Skip to content

Evaluation Overview

The Easy Translate Evaluation framework is designed to benchmark the performance of machine translation models and large language models across multiple languages, using consistent metrics and datasets.


Goal

To provide a reproducible, flexible, and extensible pipeline for:

  • Evaluating translation quality using standard metrics
  • Comparing traditional NMT models with modern local LLM-based translators
  • Identify an effective model for local, offline translation

Dataset: WMT-19

The WMT-19 (Workshop on Machine Translation 2019) dataset serves as the primary benchmark for this evaluation. It includes professionally curated parallel corpora across many language pairs and is widely adopted in machine translation research.

Key attributes:

  • High-quality human references for English translations
  • Standardized formatting, compatible with MT toolkits
  • A sample of 1,000 sentence pairs per language was selected to manage computational requirements

Language Pairs

Translations were evaluated from six source languages into English, using subsets of the WMT-19 dataset:

  • 🇩🇪 German → English (de-en)
  • 🇫🇮 Finnish → English (fi-en)
  • 🇮🇳 Gujarati → English (gu-en)
  • 🇱🇹 Lithuanian → English (lt-en)
  • 🇷🇺 Russian → English (ru-en)
  • 🇨🇳 Chinese → English (zh-en)

Tools & Frameworks

Two model types were included in the evaluation:

  • Transformer-based machine translation models, accessed through the Hugging Face Transformers library
  • Locally-run LLMs such as LLaMA, Mistral, Gemma, and Phi, deployed via Ollama

This setup enables comparison between domain-specific MT systems and general-purpose LLMs in translation tasks.


Models Evaluated

Model Type Size (Parameters) Description
mbart50 MT-specific ~610M Facebook’s multilingual encoder-decoder model for 50+ languages.
marian MT-specific ~280M Fast, language-pair–specific model from Helsinki-NLP.
nllb MT-specific 600M (distilled) Facebook’s "No Language Left Behind" (distilled version).
m2m100 MT-specific ~418M Facebook’s many-to-many multilingual model supporting 100+ languages.
mistral General LLM 7B Decoder-only open LLM (Mistral AI); not trained for translation.
llama3.1 General LLM 8B Meta’s LLaMA 3 model (8B), with instruction tuning and multilingual potential.
llama3.2 General LLM 3B Smaller LLaMA 3 variant (3B); weaker performance on translation tasks.
gemma General LLM 4B Google’s compact open-source LLM, multilingual-capable.
phi3 General LLM 3.8B Microsoft’s small LLM designed for efficiency; weak on translation.

Language Mappings

Each model architecture requires different language code formats. The following table illustrates the mappings used for the de-en (German to English) direction:

Model Type Source Lang Target Lang
nllb deu_Latn eng_Latn
m2m100 de en
mbart50 de_DE en_XX
marian de en
llama/gemma/phi3/mistral German English

Equivalent mappings were defined for all language pairs during evaluation in the configurations.

Evaluation Results

BLEU Scores

base_model de-en fi-en gu-en lt-en ru-en zh-en
gemma 0.2173 0.1929 0.0980 0.1756 0.2208 0.2253
llama3.1 0.2054 0.1730 0.0951 0.1000 0.1953 0.1957
llama3.2 0.1757 0.1314 0.0215 0.0768 0.1741 0.1806
m2m100 0.2223 0.2163 0.0193 0.2484 0.2072 0.2026
marian 0.2899 0.2946 NaN NaN 0.1930 0.2573
mbart50 0.2859 0.2928 0.8241 0.5610 0.2307 0.2721
mistral 0.1812 0.1144 0.0103 0.0388 0.1934 0.1803
nllb 0.2726 0.2570 0.1026 0.2570 0.2141 0.2476
phi3 0.1520 0.0318 0.0000 0.0099 0.0951 0.1204

METEOR Scores

base_model de-en fi-en gu-en lt-en ru-en zh-en
gemma 0.5098 0.4902 0.3822 0.4187 0.4517 0.5118
llama3.1 0.5060 0.4724 0.3897 0.3820 0.4291 0.4921
llama3.2 0.4708 0.4113 0.3160 0.3071 0.3952 0.4691
m2m100 0.5106 0.5107 0.0724 0.5055 0.4219 0.4725
marian 0.5754 0.5048 NaN NaN 0.4955 0.5662
mbart50 0.5609 0.5384 0.6526 0.8063 0.4937 0.5777
mistral 0.4772 0.3782 0.0862 0.2632 0.4135 0.4740
nllb 0.5455 0.4375 0.3934 0.5194 0.4784 0.5166
phi3 0.3825 0.2192 0.0584 0.1779 0.3214 0.3948

Averaged Scores

base_model BLEU METEOR
mbart50 0.4122 0.6139
marian 0.2588 0.5465
nllb 0.2252 0.5106
gemma 0.1883 0.4607
m2m100 0.1860 0.4405
llama3.1 0.1608 0.4452
llama3.2 0.1267 0.3949
mistral 0.1182 0.3493
phi3 0.0682 0.2693