DPO + QLoRA Fine-Tuning for Oil & Gas Abbretivation Grounding

Cloudera AMP: https://cloudera.github.io/Applied-ML-Prototypes/#/community

Overview

This project fine-tunes a small language model (SLM) using Direct Preference Optimization (DPO) with QLoRA adapters to ground responses in the Oil and Gas domain. Unlike other projects, you can post-train the SLM locally with the given small domain dataset for hands-on-experiment. This AMP is a great starting point to train your private AI with DPO on a laptop or desktop without a remote-GPU. Start with your private model quickly and extend the AMP with larger models, larger datasets, and additional GPUs. This project require GPU for Compute Unified Device Architecture (CUDA) usage and Hugging Face (HF) to load an LLM.

Why DPO?

When should you use DPO instead of Supervised Fine-Tuning (SFT)?

Method	Best For	Data Format
SFT	Teaching the model what to say	Input → Correct output
DPO	Teaching the model which response is better	Input → Preferred vs Rejected

Use DPO when:

You want to correct terminology preferences ("use 'DP' for Dynamic Positioning, not Demographic Parity")
You need to adjust tone, style, or formatting behavior
You have examples of good vs bad responses, not just good responses
You're aligning model behavior rather than adding new knowledge

Use SFT when:

You're teaching the model new domain knowledge
You have high-quality instruction-response pairs
There's a single "correct" answer for each input

Why DPO over RLHF?

Traditional RLHF requires training a separate reward model, then using reinforcement learning (PPO) to optimize against it. This is complex, unstable, and compute-intensive. DPO achieves equivalent results with a simple classification loss on preference pairs—no reward model, no RL loop, and significantly easier to tune.

This makes DPO ideal for on-device fine-tuning where computational resources are limited.

Project Workflow

Edit configurations in 00_configs/dpo.json.
Prepare a dataset for preference pairs in JSONL (Prepared; instruction, chosen, rejected) in 01_data/dpo/train.jsonl.
Run DPO training and merge the adapter with dpo_training.ipynb.
Optional: run the training on 02_src/train_dpo.py, which serves for the back-end of (3).

Key Features

On-device DPO training with QLoRA adapters (4-bit quantization supported; faster) that does not need a remote GPU.
Hands-on workflow on Jupyter Notebook in dpo_training.ipynb.

Hardware check

Please consider the hardware condition using 01_data/gpu_check.py or the GPU check block of the dpo_training.ipynb. You need enough memory to run this code (e.g.,mem_free/total_GB: 3.5/4.3). Important: you need enough GPU (e.g. 2 GB or greater) and CPU memory (e.g, 8 GB or greater)!!

Repository Layout

00_configs/ - training configs and secrets.
01_data/ - training data.
02_src/ - training, inference, and utilities
03_scripts/ - helper scripts.
04_models/ - adapter and merged model outputs.
05_logs/ - training logs.

Getting Started

Prerequisites

Python 3.10+ (match your PyTorch/CUDA build).
CUDA-capable GPU recommended (8GB VRAM typical for LLM of 1B + QLoRA).
HF token for gated models (set HF_TOKEN or 00_configs/secrets.toml).
Policy consent on HF to use meta-llama/Llama-3.2-1B-Instruct model

Project Setup

pip install -r requirements.txt

Data Format

Training data is JSONL with preference pairs:

{"instruction": "...", "chosen": "...", "rejected": "..."}

In this project, we use the paired dataset for domain vocabulary training. For example, 'DP' from Dynamic Positioning', that is defined as a computer-controlled system that automatically maintains a vessel's position and heading using its own propellers and thrusters, eliminating the need for anchors. In contrast, the term 'DP' is Demographic Parity in AI fairness research and the abbreviation has a different meaning.

Configuration

You need to get approval from the model usage agreement. For example, to use Llama family, you need to get approval via "LLAMA {number} COMMUNITY LICENSE AGREEMENT".

We suggest to use the token file 00_configs/secrets.toml for HF token:

[huggingface]
token = "hf_betrue123..."

you can start with the given 00_configs/dpo.json based on your needs:

model_name: base model (default Llama-3.2-1B-Instruct. If you would like to use a larger model, you may need remote GPUs).
dataset_hf: local JSONL path. This could be switched to another dataset.
output_dir: adapter output path.
load_in_4bit: enable QLoRA quantization.
dpo_beta, learning_rate, num_train_epochs, and more hyperparameters are adjustable.

Usage Guide

Evaluation

Since this project focuses on fine-tuning for a domain abbreviation and SLM that shows a worse performance compared to LLM, we did not integrate a general evaluation metrics. If you plan to expand the domain with greater scale with larger datasets, you should bring up evaluation metrics to measure an overall performance.

Advanced Customization

Swap Base Models

Set model_name in 00_configs/dpo.json for training.

Example (config snippet):

{
  "model_name": "microsoft/Phi-4-mini-instruct"
}

Common alternatives:

Qwen/Qwen3-4B
google/gemma-2b-it

Notes:

Gated models require an HF token and accepted model terms.
Larger models typically need smaller per_device_train_batch_size or max_seq_length; keep load_in_4bit: true for VRAM.
If the model uses different module names, update lora_target_modules accordingly.

Swap Datasets

Point dataset_hf at another JSONL or Hugging Face dataset. Ensure it has instruction, chosen, and rejected.

Prompt Templates

Adjust templates in 02_src/utils/formatting.py for a tailored formatting.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
00_configs		00_configs
01_data		01_data
02_src		02_src
03_scripts		03_scripts
04_installer		04_installer
.gitignore		.gitignore
.project-metadata.yaml		.project-metadata.yaml
LICENSE		LICENSE
README.md		README.md
catalog-entry.yaml		catalog-entry.yaml
dpo_training.ipynb		dpo_training.ipynb
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DPO + QLoRA Fine-Tuning for Oil & Gas Abbretivation Grounding

Overview

Why DPO?

Project Workflow

Key Features

Hardware check

Repository Layout

Getting Started

Prerequisites

Project Setup

Data Format

Configuration

Usage Guide

Evaluation

Advanced Customization

Swap Base Models

Swap Datasets

Prompt Templates

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DPO + QLoRA Fine-Tuning for Oil & Gas Abbretivation Grounding

Overview

Why DPO?

Project Workflow

Key Features

Hardware check

Repository Layout

Getting Started

Prerequisites

Project Setup

Data Format

Configuration

Usage Guide

Evaluation

Advanced Customization

Swap Base Models

Swap Datasets

Prompt Templates

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages