mozilla
diff --git a/‎experiments/README.md‎
Lines changed: 1 addition & 0 deletions b/‎experiments/README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎experiments/llmaat/README.md‎
Lines changed: 112 additions & 0 deletions b/‎experiments/llmaat/README.md‎
Lines changed: 112 additions & 0 deletions
diff --git a/‎experiments/llmaat/flows/configs/config.beam-sample.json‎
Lines changed: 11 additions & 0 deletions b/‎experiments/llmaat/flows/configs/config.beam-sample.json‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎experiments/llmaat/flows/configs/config.greedy.json‎
Lines changed: 10 additions & 0 deletions b/‎experiments/llmaat/flows/configs/config.greedy.json‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎experiments/llmaat/flows/configs/config.qe-rerank.json‎
Lines changed: 12 additions & 0 deletions b/‎experiments/llmaat/flows/configs/config.qe-rerank.json‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎experiments/llmaat/flows/configs/config.sample.json‎
Lines changed: 11 additions & 0 deletions b/‎experiments/llmaat/flows/configs/config.sample.json‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎experiments/llmaat/flows/configs/config.vllm.json‎
Lines changed: 14 additions & 0 deletions b/‎experiments/llmaat/flows/configs/config.vllm.json‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎experiments/llmaat/flows/evals.py‎
Lines changed: 208 additions & 0 deletions b/‎experiments/llmaat/flows/evals.py‎
Lines changed: 208 additions & 0 deletions
@@ -0,0 +1 @@
+# Very experimental code lives here
@@ -0,0 +1,112 @@
+# LLM as a teacher (llmaat)
+
+
+The goal is to be able to produce high quality parallel translation datasets with LLMs.
+This will allow finetuning NMT models to improve quality and possibly replace teacher training stage by using the LLM-produced data directly.
+
+This work follows the paper [Introducing the NewsPaLM MBR and QE Dataset:
+LLM-Generated High-Quality Parallel Data Outperforms Traditional
+Web-Crawled Data](https://arxiv.org/pdf/2408.06537).
+
+It also uses the evaluation dataset and the prompt from [WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects](https://arxiv.org/html/2502.12404v1).
+
+
+## Selecting a corpus
+
+The idea is to have a diverse monolingual dataset to translate by an LLM.
+
+It's more efficient to cluster a sample first and then assign clusters based on centroids.
+
+This part is not fully automated.
+
+Steps:
+1. Find a big monolingual corpus (100+M sentences). It can be a part of HPLT and NewsCrawl or just one side of our typical merged parallel corpus. It should be deduplicated.
+2. Sample a part of it using `shuf -n 1000000`
+3. Calculate and save embeddings for the sample (see [notebooks/Select corpus.ipynb](). 
+We use https://huggingface.co/intfloat/multilingual-e5-small. To speed it up and utilize all GPUs on a machine, we split the sample with `split` and run [scripts/emb_corpus_ddp.py]() with `torchrun --nproc_per_node=8 emb_corpus_ddp.py`
+4. Load the embeddings, cluster them with K-Means (5000 clusters) and save centroids to a file
+5. Go through the whole corpus and assign clusters based on the closest centroids by doing a NN search. Run `torchrun --nproc_per_node=8 cluster_corpus_ddp.py`, the cluster IDs are saved to a file.
+6. Select 1M, 10M and 50M lines by sampling unifromly from the clusters.
+
+The diverse samples are located here: `gs://releng-translations-dev/data/mono-llm/diverse_sample.{1,10,50}M.en.zst` 
+
+## Evaluating LLMs
+
+Run [flows/llm_eval_flow.py]() on Mozilla Outerbounds Metaflow:
+
+```bash
+export HUGGING_FACE_HUB_TOKEN=...
+export WANDB_API_KEY=...
+python llm_eval_flow.py --environment=pypi --config config ./configs/config.vllm.json run --experiment greedy --model gemma-3-27b-vllm
+```
+
+The evaluation results are available on Weights and Biases: https://wandb.ai/moz-translations/llm-evals?nw=nwuserepavlov
+
+It's possible to add more LLMs and inference methods to [flows/llm_runner.py](). `--model gemma-3-27b-vllm` points to one of the available implementations.
+
+Decoding config can be modified in [flows/configs/config.vllm.json]().
+
+The prompt can be set in the config. Available prompt templates are in [flows/prompts.py]().
+
+It allows running evaluation for multiple language pairs in one run by adding more languages to the config. All pairs are en-xx.
+
+The translation produced by an LLM during evaluation are uploaded to `gs://releng-translations-dev/data/llm-evals/wmt24pp/`.
+
+We caluculate COMET22 and MetricX-24 scores. The size of the MetricX model is set in the step `eval_metricx`.
+
+It's preferable to use vLLM as it has up to 10x higher throughput than the naive inference with HF Transformers.
+
+vLLM config:
+
+```python
+{
+  "batch_size": 1024, # Should be big enough to get the most of the vLLM optimizations
+  "langs": ["ru_RU"], # Languages to evaluate
+  "max_tok_alpha": 2.0, # A factor to multiply the number of imput tokens to get the maximum number of output tokens. It might depend on the output language. An optimization.
+  "prompt": "noomit_fewshot", # Prompt template key
+  "llm": {
+    "max_model_len": 1024, # The model context size (maximum total of input and output tokens)
+    "tensor_parallel_size": 1 # The number of GPUs
+  },
+  "decoding": {
+    "temperature": 0, # Tempreture 0 means greedy decoding, change to activate sampling
+    "n": 1 # Produce only 1 candidate, increase for QE reranking
+  }
+}
+```
+
+## Generating datasets
+
+Run [flows/llm_run_flow.py]() on Mozilla Outerbounds Metaflow:
+
+```bash
+export HUGGING_FACE_HUB_TOKEN=...
+python llm_run_flow.py \
+    --environment=pypi --config config ./configs/config.vllm.json run --experiment finetune10M \
+    --model gemma-3-27b-vllm --data_size 10 --lang ru_RU --part_size 500000 --max-workers 4
+```
+
+`--data_size 10` - use 10M dataset to produce 10M translations
+
+`--part_size 500000` - how many lines to process in one Metaflow task
+
+`--max-workers 4` - run 4 tasks max simultaniously (current limitation on the number of GPUs)
+
+The translations will be uploaded to `gs://releng-translations-dev/data/llm/`.
+
+## Quality aware decoding (QE reranking)
+
+Following the NewsPALM paper it's possible to replace regular greedy decoding with sampling of multiple candidates and choosing the best one using MetricX-24-Hybrid quality estimation model.
+
+It required activating the code branch with `pick_best` metaflow step and changing the decoding config (for vllm `decoding.n` > 1, e.g. `decoding.n: 32`).
+Decoding will become significantly slower as the model needs to generate N samples instead of one now.
+
+Also, the activated `pick_best` step that runs MetricX model is unoptimized and quite slow now.
+
+## Language codes
+
+We use WMT24++ format of the language codes that include a reference to a country because some prompts require specifying it.
+
+See all available codes in [flows/langs.py]()
+
+
@@ -0,0 +1,11 @@
+{
+  "batch_size": 8,
+  "langs": ["ru"],
+  "max_tok_alpha": 2.0,
+  "decoding": {
+    "num_beams": 5,
+    "do_sample": true,
+    "temperature": 0.6,
+    "top_p": 0.9
+  }
+}
@@ -0,0 +1,10 @@
+{
+  "batch_size": 64,
+  "langs": ["ru"],
+  "max_tok_alpha": 2.0,
+  "decoding": {
+    "num_beams": 1,
+    "do_sample": false,
+    "temperature": 0
+  }
+}
@@ -0,0 +1,12 @@
+{
+  "batch_size": 4,
+  "langs": ["ru"],
+  "max_tok_alpha": 2.0,
+  "decoding": {
+    "num_beams": 1,
+    "do_sample": true,
+    "temperature": 1.0,
+    "top_p": 0.9,
+    "num_return_sequences": 16
+  }
+}
@@ -0,0 +1,11 @@
+{
+  "batch_size": 64,
+  "langs": ["ru"],
+  "max_tok_alpha": 2.0,
+  "decoding": {
+    "num_beams": 1,
+    "do_sample": true,
+    "temperature": 0.6,
+    "top_p": 0.9
+  }
+}
@@ -0,0 +1,14 @@
+{
+  "batch_size": 1024,
+  "langs": ["ru_RU"],
+  "max_tok_alpha": 2.0,
+  "prompt": "noomit_fewshot",
+  "llm": {
+    "max_model_len": 1024,
+    "tensor_parallel_size": 1
+  },
+  "decoding": {
+    "temperature": 0,
+    "n": 1
+  }
+}
@@ -0,0 +1,208 @@
+from typing import List
+
+EVAL_PAIRS = (
+    "en-ar_EG",
+    "en-ar_SA",
+    "en-bg_BG",
+    "en-bn_IN",
+    "en-ca_ES",
+    "en-cs_CZ",
+    "en-da_DK",
+    "en-de_DE",
+    "en-el_GR",
+    "en-es_MX",
+    "en-et_EE",
+    "en-fa_IR",
+    "en-fi_FI",
+    "en-fil_PH",
+    "en-fr_CA",
+    "en-fr_FR",
+    "en-gu_IN",
+    "en-he_IL",
+    "en-hi_IN",
+    "en-hr_HR",
+    "en-hu_HU",
+    "en-id_ID",
+    "en-is_IS",
+    "en-it_IT",
+    "en-ja_JP",
+    "en-kn_IN",
+    "en-ko_KR",
+    "en-lt_LT",
+    "en-lv_LV",
+    "en-ml_IN",
+    "en-mr_IN",
+    "en-nl_NL",
+    "en-no_NO",
+    "en-pa_IN",
+    "en-pl_PL",
+    "en-pt_BR",
+    "en-pt_PT",
+    "en-ro_RO",
+    "en-ru_RU",
+    "en-sk_SK",
+    "en-sl_SI",
+    "en-sr_RS",
+    "en-sv_SE",
+    "en-sw_KE",
+    "en-sw_TZ",
+    "en-ta_IN",
+    "en-te_IN",
+    "en-th_TH",
+    "en-tr_TR",
+    "en-uk_UA",
+    "en-ur_PK",
+    "en-vi_VN",
+    "en-zh_CN",
+    "en-zh_TW",
+    "en-zu_ZA",
+)
+
+
+lang_map = {
+    pair.split("_")[0].split("-")[1]: pair
+    for pair in EVAL_PAIRS
+    if pair.split("_")[1] not in {"TW", "PT", "CA", "EG", "TZ"}
+}
+
+
+def load_data(lang):
+    from datasets import load_dataset
+
+    #
+    # if lang not in lang_map:
+    #     raise ValueError(f"Language {lang} is not supported")
+
+    # Login using e.g. `huggingface-cli login` to access this dataset
+    print(f"Downloading dataset for {lang}")
+    lp = f"en-{lang}"
+    ds = load_dataset("google/wmt24pp", lp)
+    filtered = ds.filter(lambda ex: not ex["is_bad_source"] and ex["lp"] == lp)["train"]
+    return filtered["source"], filtered["target"]
+
+
+def eval_comet(source_texts, target_translations, target_references):
+    import comet
+
+    comet_checkpoint = comet.download_model("Unbabel/wmt22-comet-da")
+    comet_model = comet.load_from_checkpoint(comet_checkpoint)
+    comet_data = []
+    for source, target, target_ref in zip(source_texts, target_translations, target_references):
+        comet_data.append({"src": source, "mt": target, "ref": target_ref})
+    comet_results = comet_model.predict(comet_data, gpus=1)
+    return round(comet_results.system_score * 100, 2)
+
+
+def eval_metricx(
+    source_texts,
+    target_translations,
+    target_references,
+    model_size="xl",
+    fp16=True,
+    batch_size=8,
+):
+    """
+    https://huggingface.co/google/metricx-24-hybrid-xxl-v2p6
+
+    Available model sizes: "large" (1.2B), "xl" (3.7B), "xxl" (13b)
+    """
+
+    import json
+    from statistics import mean
+    from metricx.predict import predict
+
+    with open("input.jsonl", "w") as in_file:
+        for source, target, target_ref in zip(
+            source_texts, target_translations, target_references
+        ):
+            ex_dict = {"source": source, "reference": target_ref, "hypothesis": target}
+            in_file.write(json.dumps(ex_dict) + "\n")
+
+    model_name = f"google/metricx-24-hybrid-{model_size}-v2p6"
+    if fp16:
+        model_name += "-bfloat16"
+
+    # batch size is divided by number of GPUs, set equal or higher
+    print(f"Running evaluation with {model_name} reference based")
+    predict(
+        tokenizer=f"google/mt5-{model_size}",
+        model_name_or_path=model_name,
+        max_input_length=1536,
+        batch_size=batch_size,
+        input_file="input.jsonl",
+        output_file="output.ref.jsonl",
+        qe=False,
+    )
+
+    print(f"Running evaluation with {model_name} reference free QE")
+    predict(
+        tokenizer=f"google/mt5-{model_size}",
+        model_name_or_path=model_name,
+        max_input_length=1536,
+        batch_size=batch_size,
+        input_file="input.jsonl",
+        output_file="output.qe.jsonl",
+        qe=True,
+    )
+
+    with open("output.qe.jsonl") as out_qe:
+        qe_score = mean([float(json.loads(line)["prediction"]) for line in out_qe])
+    with open("output.ref.jsonl") as out_ref:
+        ref_score = mean([float(json.loads(line)["prediction"]) for line in out_ref])
+
+    return {f"metricx24-{model_size}-qe": qe_score, f"metricx24-{model_size}": ref_score}
+
+
+def select_best(
+    source: List[str], translations: List[List[str]], model_size="xl", fp16=True, batch_size=8
+) -> List[str]:
+    import json
+    from metricx.predict import predict
+
+    with open("input.jsonl", "w") as in_file:
+        for (
+            source,
+            tr_candidates,
+        ) in zip(source, translations):
+            for translation in tr_candidates:
+                ex_dict = {"source": source, "hypothesis": translation}
+                in_file.write(json.dumps(ex_dict) + "\n")
+
+    model_name = f"google/metricx-24-hybrid-{model_size}-v2p6"
+    if fp16:
+        model_name += "-bfloat16"
+
+    print(f"Running evaluation with {model_name} reference free QE")
+    predict(
+        tokenizer=f"google/mt5-{model_size}",
+        model_name_or_path=model_name,
+        max_input_length=1536,
+        batch_size=batch_size,
+        input_file="input.jsonl",
+        output_file="output.qe.jsonl",
+        qe=True,
+    )
+
+    with open("output.qe.jsonl") as out_qe:
+        scores = [json.loads(line)["prediction"] for line in out_qe]
+
+    num_candidates = len(translations[0])
+
+    best = []
+    for i, candidates in enumerate(translations):
+        start = i * num_candidates
+        candidate_scores = scores[start : start + num_candidates]
+        best_idx = candidate_scores.index(min(candidate_scores))
+        best.append(candidates[best_idx])
+    return best
+
+
+def _run_cmd(cmd):
+    import subprocess
+
+    try:
+        subprocess.run(cmd, check=True, capture_output=True, shell=True)
+    except subprocess.CalledProcessError as e:
+        print("STDOUT:", e.stdout.decode("utf-8", errors="replace"))
+        print("STDERR:", e.stderr.decode("utf-8", errors="replace"))
+        raise