Skip to content

nju-websoft/CTO

Repository files navigation

Improving Code Translation with Syntax-Guided and Semantic-aware Preference Optimization, IJCAI 2026

📌 Introduction

This repository provides a training pipeline for CTO, including supervised fine-tuning, semantic model training, and preference optimization. Follow the steps below to set up your environment and train the models.

📖 Table of Contents

📃 File Structure

├── data/                # Used dataset
│   ├── humanevalx   
│   ├── transcoder-test
│   └── xlcost 
│   └── reward_data      # data for semantic reward model
│   └── preference_dataset   # data for preference learning
├── dataset/             # Dataset reader pipeline
│   ├── base_dataset  
│   ├── humanevalx.py
│   ├── transcoder
│   ├── xlcost.py 
│   └── utils.py
├── env/                 # Necessary file for java execution
├── evaluation/          # Evaluation pipeline
│   ├── lang_executor/   # Executor of different programming language
│   ├── temp/            # temp file in evaluation
│   ├── tests/           # tests for lang_executor
├── semantic/            # Module for semantic model training
│   ├── train_reward.sh  # Script to train semantic model
│   └── configs          # Training configs
├── trainer              # Model trainer
├── requirements.txt     # dependency file
├── cto_trainer.py       # script for preference optimization
├── sft_trainer.py       # script for supervised training
├── merge_peft.py        # merge lora script
├── run.py               # script for evaluation
└── README.md 

🔧 Environment Setup

First, install the required dependencies by running:

pip install -r requirements.txt

Download data here.

📚 Supervised Fine-tuning & Preference Dataset Construction

To perform supervised fine-tuning, execute the following command:

python sft_trainer.py \
    --model_name_or_path codellama/CodeLlama-7b-hf \
    --output_dir <YOUR_LORA_OUTPUT_DIR> \
    --dataset_path <YOUR_SFT_DATASET_PATH> \

Replace <YOUR_LORA_OUTPUT_DIR> and <YOUR_SFT_DATASET_PATH>(in ./data/xlcost) with the appropriate paths. Then, merge the lora weight with the original CodeLlama-7B model.

python merge_peft.py \
    --adapter_dir <YOUR_LORA_OUTPUT_DIR>/final_checkpoint \
    --output_dir <YOUR_SFT_MODEL_OUTPUT_DIR> \

🎯 Semantic Model Training

  • Navigate to the semantic directory.

We provided the train file in reward_data.zip.

Then, train the semantic model by running:

bash ./train_reward.sh

Preference Optimization

Run the following command to preference optimization:

python cto_trainer.py \
    --model_path <YOUR_SFT_MODEL_OUTPUT_DIR>/final_checkpoint \
    --src_lang java \
    --tgt_lang cpp \
    --output_dir <CTO_LORA_CHECKPOINT> \
    --preference_dataset_file <PREFERENCE_FILE>

Replace the placeholders with the actual paths to your models and dataset.

We provide the preference dataset in data/preference_dataset.zip.

Then merge the lora checkpoint:

python merge_peft.py \
    --adapter_dir <CTO_LORA_CHECKPOINT>\
    --output_dir <YOUR_CTO_MODEL_OUTPUT_DIR> \

Evaluation

To evaluate the CA@1 of model, you can run:

python run.py \
    --model_path <YOUR_CTO_MODEL_OUTPUT_DIR>/final_checkpoint \
    --dataset_name transcoder \
    --src_lang java \
    --tgt_lang cpp \
    --sample_k 1 \
    --save_path <TRANSLATION_JSON_FILE>

dataset_name in transcoder or humanevalx, src_lang and tgt_lang in java, cpp, python.

Citation

@inproceedings{cto2026, title = {Improving Code Translation with Syntax-Guided and Semantic-aware Preference Optimization}, booktitle = {IJCAI}, year = {2026} }

About

Improving Code Translation with Syntax-Guided and Semantic-aware Preference Optimization, IJCAI 2026

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors