Configuration

This section describes the global configuration used in the translation evaluation framework. The configuration defines available models, evaluation targets, dataset settings, and language-specific code mappings for each model type.

Overview

All configuration parameters are stored in config.py and language_mappings.yaml. These define:

The available translation models
The models to be evaluated in the current run
Dataset split settings
Output directory paths
Language code mappings by model and language pair

Model Registry

Translation models are registered in the MODEL_REGISTRY dictionary. Each key corresponds to a model name used in evaluation scripts, and the value is either a translator class or a pre-configured partial constructor.

MODEL_REGISTRY = {
    "nllb":     NllbTranslator,
    "m2m100":   M2M100Translator,
    "mbart50":  MBartTranslator,
    "marian":   MarianTranslator,
    "llama3.2": partial(LLMTranslator, model_name="llama3.2:3b"),
    "llama3.1": partial(LLMTranslator, model_name="llama3.1:8b"),
    "gemma":    partial(LLMTranslator, model_name="gemma3:4b"),
    "phi3":     partial(LLMTranslator, model_name="phi3:3.8b"),
    "mistral":  partial(LLMTranslator, model_name="mistral:7b"),
}

Models to Evaluate

The MODELS_TO_EVALUATE list determines which models are run during evaluation:

MODELS_TO_EVALUATE = [
    "mistral",
    "m2m100",
    "marian",
    "nllb",
    "mbart50",
    "llama3.1",
    "llama3.2",
    "gemma",
    "phi3",
]

Dataset Configuration

Evaluation is based on the WMT-19 dataset using a controlled subset:

DATASET_NAME = "wmt19"
DATASET_SPLIT = "train[:1000]"

DATASET_NAME defines the source dataset.
DATASET_SPLIT specifies a sample of 1,000 sentence pairs per language pair for evaluation.

Output Directory

Generated evaluation reports are saved to:

OUTPUT_DIR = Path("reports")

Language Code Mappings

Each translation model requires specific source/target language code formats. These mappings are defined in the external YAML file:

LANGUAGE_MAPPING_PATH = Path("configs/language_mappings.yaml")

Sample structure from language_mappings.yaml:

de-en:
  nllb:    {source: deu_Latn,  target: eng_Latn}
  mbart50: {source: de_DE,     target: en_XX}
  marian:  {source: de,        target: en}
  mistral: {source: German,    target: English}

This format is repeated for each language pair (fi-en, gu-en, etc.), and each entry is model-specific. This allows exact alignment with the required format for each tokenizer and model.

Summary

Config Parameter	Purpose
`MODEL_REGISTRY`	Maps model names to translator classes
`MODELS_TO_EVALUATE`	List of models to run
`DATASET_NAME`	Defines which dataset to use
`DATASET_SPLIT`	Limits dataset to manageable size
`OUTPUT_DIR`	Destination for reports and results
`LANGUAGE_MAPPING_PATH`	Path to language code definitions (YAML)