Containing Catastrophic Forgetting in Code-Switched Multitasks Continual Finetuning

Overview

This repository accompanies a study on mitigation measures for catastrophic forgetting in code-switched multitasks continual finetuning. The work consists of two primary experimental components:

Empirical validation of catastrophic forgetting in vanilla finetuning pipline.
Evaluation of forgetting mitigation strategies with a focus on adapter-based methods combined with data replay and/or knowledge distillation to improve knowledge retention in incremental CoS tasks.

The repository provides all necessary scripts to collect, preprocess, and generate CoS datasets, as well as to replicate the training and evaluation of all experimental configurations reported in the study.

Repository Structure

`train_data_gen/`

This directory contains scripts for generating training datasets for CoS experiments.
Some scripts are adapted from the Lost in the Mix (LIM) project:
https://github.com/amr-mohamedd/Lost-in-the-Mix

load_raw_train_data.py
Downloads original datasets, including MLQA, MMLU/MMMLU, and XNLI.
mlqa_mcq_gen.py
Generates distractor (incorrect) answers for MLQA multiple-choice questions.
training_data_gen.py
Wrapper script for generating CoS training data from the original datasets.
training_cos_gen.py
Implements LIM-CoS and T-CoS generation procedures.
async_llama_query.py
Provides concurrent querying support for large language models.

`eval_data_gen/`

This directory contains scripts for generating evaluation and benchmark datasets for CoS experiments.
Some scripts are adapted from the Lost in the Mix (LIM) project:
https://github.com/amr-mohamedd/Lost-in-the-Mix

prepare_dataset.py
Downloads original benchmark datasets, including Belebele, MMLU/MMMLU, and XNLI.
eval_data_gen.py
Wrapper script for generating CoS evaluation data from the original benchmark datasets.
training_cos_gen.py
Implements LIM-CoS and T-CoS generation procedures for evaluation data.
async_llama_query.py
Provides concurrent querying support for large language models.

`cos_finetuning/`

This directory contains code for model training, continual learning configurations, evaluation, and Pareto dominance ranking.

qwen_base_head_training.py
Pre-trains the classifier head for the Qwen3-0.6B base model.
adp_lora_normal_sft_3_save.py
Main training script supporting multiple training configurations, including:
1. Vanilla fine-tuning
2. Pfeiffer adapters
3. LoRA adapters
4. Raw data replay
5. Learning without Forgetting (LwF) knowledge distillation
model_eval.py
Model evaluation script.
eval_utils.py
Evaluation utilities, including accuracy computation and the (deprecated) knowledge entropy metric.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cos_finetuning		cos_finetuning
eval_data_gen		eval_data_gen
train_data_gen		train_data_gen
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Containing Catastrophic Forgetting in Code-Switched Multitasks Continual Finetuning

Overview

Repository Structure

`train_data_gen/`

`eval_data_gen/`

`cos_finetuning/`

About

Uh oh!

Releases

Packages

Languages

License

QuanHNguyen232/code_switch_model

Folders and files

Latest commit

History

Repository files navigation

Containing Catastrophic Forgetting in Code-Switched Multitasks Continual Finetuning

Overview

Repository Structure

train_data_gen/

eval_data_gen/

cos_finetuning/

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`train_data_gen/`

`eval_data_gen/`

`cos_finetuning/`

Packages