This project demonstrates fine-tuning a multilingual Transformer (e.g., XLM-RoBERTa) on a subset of SQuAD v1.1 (English), then evaluating zero-shot performance on the XQuAD dataset (Spanish). By training only on English data, we test how well the model transfers its knowledge to another language without further fine-tuning.
-
Load and Subset SQuAD
- SQuAD v1.1 is a large English QA dataset.
- For demonstration and to keep training time low, we use a small subset (e.g., 200 examples for training, 50 for validation).
-
Load and Subset XQuAD
- XQuAD is a multilingual variant of SQuAD, translated into multiple languages.
- We focus on the Spanish split and again use a small subset (e.g., 50 examples) to reduce runtime.
-
Preprocess
- Tokenize
(context, question)pairs and compute start/end positions for answers. - Keep crucial fields (
offset_mapping,example_id,context_text) to reconstruct answers during post-processing.
- Tokenize
-
Train
- We use Hugging Face’s
TrainerAPI to fine-tunexlm-roberta-basefor a few epochs on the English subset of SQuAD.
- We use Hugging Face’s
-
Evaluate
- Evaluate the model on (a) the English SQuAD validation subset, and (b) zero-shot on the Spanish XQuAD subset.
- Post-process predictions to convert logits into text answers, and compute Exact Match (EM) and F1 with the official SQuAD metric.
A typical structure might look like this:
.
├── .
│ ├── main.py <- Main script
| ├── train.py
│ ├── test.py
│ ├── consts.py
│ ├── README.md <- (this file)
│ └── requirements.txt <- optional, if you want to list exact dependencies
main.py: The main code script with all the data loading, preprocessing, training, and evaluation logic.README.md: This documentation.requirements.txt: (Optional) A list of Python dependencies (e.g.,transformers,datasets,torch, etc.).
-
Create and activate a virtual environment (recommended, but optional):
python -m venv env source env/bin/activate -
Install the requirements:
pip install -r requirements.txt
-
Run the script:
python main.py
-
Expected output:
- The script will load a small subset of SQuAD (200 training examples, 50 validation examples).
- It will tokenize and compute start/end positions, then train for a small number of epochs (e.g., 1 epoch).
- It will evaluate on the English validation subset, printing out Exact Match and F1 scores.
- Finally, it will predict on the Spanish XQuAD subset (50 examples) and compute the same metrics zero-shot.
-
Training Data Size
- In
train.py, you’ll see something like:Increase (or decrease) these values for more (or fewer) examples.train_dataset = squad_dataset["train"].select(range(200)) valid_dataset = squad_dataset["validation"].select(range(50))
- In
-
Number of Epochs
- Adjust
num_train_epochsinTrainingArgumentsto train longer on the small subset (or fewer if you just want a quick test).
- Adjust
-
Batch Size
- Change
per_device_train_batch_sizeandper_device_eval_batch_sizeto accommodate your GPU memory.
- Change
-
Different Languages
- The script loads Spanish from XQuAD by specifying
"es". - You can try
"ar","hi","zh", etc., if available in XQuAD or other multilingual QA datasets.
- The script loads Spanish from XQuAD by specifying
-
Larger Models
- Swap
model_name = "xlm-roberta-base"for"xlm-roberta-large"or"bert-base-multilingual-cased", if you have sufficient resources.
- Swap
-
Using the Full Dataset
- Remove or adjust the
.select(...)lines if you want to train on all of SQuAD. But note that training time will significantly increase.
- Remove or adjust the
- Hugging Face Transformers
- Hugging Face Datasets (SQuAD)
- Hugging Face Datasets (XQuAD)
- SQuAD v1.1 Paper
- XQuAD Paper
This code references open-source datasets (SQuAD, XQuAD) and uses the Apache 2.0 license from Hugging Face Transformers. Ensure compliance with any dataset-specific licenses if you plan to use them for production or commercial research.
Feel free to open issues or discussions if you have questions about multilingual QA, zero-shot transfer, or fine-tuning larger language models. Contributions and ideas for improvement are always welcome!