Skip to content

Evaluation Metrics

Translation quality was evaluated using two widely accepted automatic metrics: BLEU and METEOR. Each captures different aspects of translation performance and helps compare outputs from traditional NMT models and modern LLMs.


BLEU (Bilingual Evaluation Understudy)

BLEU is a precision-based metric that measures n-gram overlap between a model’s translation and a reference translation.

  • Best suited for literal translations with consistent word choices
  • Performs reliably for high-resource language pairs with rigid grammar structures

Scoring:

  • Range: 0.0 to 1.0 (expressed as a decimal or percentage)
  • Higher scores indicate closer surface-level matches.
Score Interpretation
0.40+ Excellent (near-human, fluent output)
0.30–0.40 Good quality
0.20–0.30 Understandable but flawed
0.10–0.20 Low quality or partially incorrect
< 0.10 Very poor translation or off-topic

Limitations:

  • Does not account for synonyms or word order
  • Penalizes legitimate paraphrasing, which may bias against LLMs

METEOR (Metric for Evaluation of Translation with Explicit ORdering)

METEOR offers a more nuanced evaluation by considering:

  • Stemming (e.g., “run” vs. “running”)
  • Synonyms (e.g., “child” vs. “kid”)
  • Word reordering

Scoring:

  • Range: 0.0 to 1.0
  • METEOR scores may be lower than BLEU in literal translations, but often higher for generated outputs due to its handling of synonyms, stems, and flexible word order
Score Interpretation
0.60+ Very strong (fluent and faithful)
0.50–0.60 Good quality
0.40–0.50 Mixed (may contain awkward phrasing)
< 0.40 Often disfluent or semantically incorrect

Strengths:

  • Captures semantic similarity, useful for LLM outputs
  • More aligned with human judgment of fluency and meaning

Notes on Usage

  • Both metrics are used for relative comparison across models
  • Neither is perfect on its own - human evaluation is often needed for nuanced distinctions
  • METEOR tends to favor natural fluency, while BLEU rewards surface-level accuracy