Evaluation Overview

The Easy Translate Evaluation framework is designed to benchmark the performance of machine translation models and large language models across multiple languages, using consistent metrics and datasets.

Goal

To provide a reproducible, flexible, and extensible pipeline for:

Evaluating translation quality using standard metrics
Comparing traditional NMT models with modern local LLM-based translators
Identify an effective model for local, offline translation

Dataset: WMT-19

The WMT-19 (Workshop on Machine Translation 2019) dataset serves as the primary benchmark for this evaluation. It includes professionally curated parallel corpora across many language pairs and is widely adopted in machine translation research.

Key attributes:

High-quality human references for English translations
Standardized formatting, compatible with MT toolkits
A sample of 1,000 sentence pairs per language was selected to manage computational requirements

Language Pairs

Translations were evaluated from six source languages into English, using subsets of the WMT-19 dataset:

🇩🇪 German → English (de-en)
🇫🇮 Finnish → English (fi-en)
🇮🇳 Gujarati → English (gu-en)
🇱🇹 Lithuanian → English (lt-en)
🇷🇺 Russian → English (ru-en)
🇨🇳 Chinese → English (zh-en)

Tools & Frameworks

Two model types were included in the evaluation:

Transformer-based machine translation models, accessed through the Hugging Face Transformers library
Locally-run LLMs such as LLaMA, Mistral, Gemma, and Phi, deployed via Ollama

This setup enables comparison between domain-specific MT systems and general-purpose LLMs in translation tasks.

Models Evaluated

Model	Type	Size (Parameters)	Description
`mbart50`	MT-specific	~610M	Facebook’s multilingual encoder-decoder model for 50+ languages.
`marian`	MT-specific	~280M	Fast, language-pair–specific model from Helsinki-NLP.
`nllb`	MT-specific	600M (distilled)	Facebook’s "No Language Left Behind" (distilled version).
`m2m100`	MT-specific	~418M	Facebook’s many-to-many multilingual model supporting 100+ languages.
`mistral`	General LLM	7B	Decoder-only open LLM (Mistral AI); not trained for translation.
`llama3.1`	General LLM	8B	Meta’s LLaMA 3 model (8B), with instruction tuning and multilingual potential.
`llama3.2`	General LLM	3B	Smaller LLaMA 3 variant (3B); weaker performance on translation tasks.
`gemma`	General LLM	4B	Google’s compact open-source LLM, multilingual-capable.
`phi3`	General LLM	3.8B	Microsoft’s small LLM designed for efficiency; weak on translation.

Language Mappings

Each model architecture requires different language code formats. The following table illustrates the mappings used for the de-en (German to English) direction:

Model Type	Source Lang	Target Lang
`nllb`	`deu_Latn`	`eng_Latn`
`m2m100`	`de`	`en`
`mbart50`	`de_DE`	`en_XX`
`marian`	`de`	`en`
`llama/gemma/phi3/mistral`	`German`	`English`

Equivalent mappings were defined for all language pairs during evaluation in the configurations.

Evaluation Results

BLEU Scores

base_model	de-en	fi-en	gu-en	lt-en	ru-en	zh-en
gemma	0.2173	0.1929	0.0980	0.1756	0.2208	0.2253
llama3.1	0.2054	0.1730	0.0951	0.1000	0.1953	0.1957
llama3.2	0.1757	0.1314	0.0215	0.0768	0.1741	0.1806
m2m100	0.2223	0.2163	0.0193	0.2484	0.2072	0.2026
marian	0.2899	0.2946	NaN	NaN	0.1930	0.2573
mbart50	0.2859	0.2928	0.8241	0.5610	0.2307	0.2721
mistral	0.1812	0.1144	0.0103	0.0388	0.1934	0.1803
nllb	0.2726	0.2570	0.1026	0.2570	0.2141	0.2476
phi3	0.1520	0.0318	0.0000	0.0099	0.0951	0.1204

METEOR Scores

base_model	de-en	fi-en	gu-en	lt-en	ru-en	zh-en
gemma	0.5098	0.4902	0.3822	0.4187	0.4517	0.5118
llama3.1	0.5060	0.4724	0.3897	0.3820	0.4291	0.4921
llama3.2	0.4708	0.4113	0.3160	0.3071	0.3952	0.4691
m2m100	0.5106	0.5107	0.0724	0.5055	0.4219	0.4725
marian	0.5754	0.5048	NaN	NaN	0.4955	0.5662
mbart50	0.5609	0.5384	0.6526	0.8063	0.4937	0.5777
mistral	0.4772	0.3782	0.0862	0.2632	0.4135	0.4740
nllb	0.5455	0.4375	0.3934	0.5194	0.4784	0.5166
phi3	0.3825	0.2192	0.0584	0.1779	0.3214	0.3948

Averaged Scores

base_model	BLEU	METEOR
mbart50	0.4122	0.6139
marian	0.2588	0.5465
nllb	0.2252	0.5106
gemma	0.1883	0.4607
m2m100	0.1860	0.4405
llama3.1	0.1608	0.4452
llama3.2	0.1267	0.3949
mistral	0.1182	0.3493
phi3	0.0682	0.2693