Evaluation Metrics
Translation quality was evaluated using two widely accepted automatic metrics: BLEU and METEOR. Each captures different aspects of translation performance and helps compare outputs from traditional NMT models and modern LLMs.
BLEU (Bilingual Evaluation Understudy)
BLEU is a precision-based metric that measures n-gram overlap between a model’s translation and a reference translation.
- Best suited for literal translations with consistent word choices
- Performs reliably for high-resource language pairs with rigid grammar structures
Scoring:
- Range: 0.0 to 1.0 (expressed as a decimal or percentage)
- Higher scores indicate closer surface-level matches.
Score | Interpretation |
---|---|
0.40+ | Excellent (near-human, fluent output) |
0.30–0.40 | Good quality |
0.20–0.30 | Understandable but flawed |
0.10–0.20 | Low quality or partially incorrect |
< 0.10 | Very poor translation or off-topic |
Limitations:
- Does not account for synonyms or word order
- Penalizes legitimate paraphrasing, which may bias against LLMs
METEOR (Metric for Evaluation of Translation with Explicit ORdering)
METEOR offers a more nuanced evaluation by considering:
- Stemming (e.g., “run” vs. “running”)
- Synonyms (e.g., “child” vs. “kid”)
- Word reordering
Scoring:
- Range: 0.0 to 1.0
- METEOR scores may be lower than BLEU in literal translations, but often higher for generated outputs due to its handling of synonyms, stems, and flexible word order
Score | Interpretation |
---|---|
0.60+ | Very strong (fluent and faithful) |
0.50–0.60 | Good quality |
0.40–0.50 | Mixed (may contain awkward phrasing) |
< 0.40 | Often disfluent or semantically incorrect |
Strengths:
- Captures semantic similarity, useful for LLM outputs
- More aligned with human judgment of fluency and meaning
Notes on Usage
- Both metrics are used for relative comparison across models
- Neither is perfect on its own - human evaluation is often needed for nuanced distinctions
- METEOR tends to favor natural fluency, while BLEU rewards surface-level accuracy