Skip to content

Commit 9ec145f

Browse files
committed
Adds fine tuning contrib example
This module shows how one can perform fine tuning using Hamilton. It is a basic example using the transformers library that connects to huggingface to pull and fine tune a FLAN model. One should be able to adapt this code to their needs.
1 parent bab44d3 commit 9ec145f

File tree

7 files changed

+826
-0
lines changed

7 files changed

+826
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
FROM python:3.10-slim-bullseye
2+
3+
WORKDIR /app
4+
5+
# install graphviz backend
6+
RUN apt-get update \
7+
&& apt-get autoremove -yqq --purge \
8+
&& apt-get clean \
9+
&& rm -rf /var/lib/apt/lists/*
10+
11+
COPY requirements.txt .
12+
RUN pip install --no-cache-dir -r requirements.txt
13+
14+
COPY example-support-dataset.json .
15+
16+
COPY . .
17+
18+
EXPOSE 8080:8080
19+
20+
# run the module
21+
CMD python __init__.py
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# Purpose of this module
2+
This module shows you how to fine-tune an LLM model. This code is inspired by this [fine-tuning code](https://github.com/dagster-io/dagster_llm_finetune/tree/main).
3+
4+
Specifically the code here, from an approach standpoint, shows Supervised Fine-Tuning (SFT) for dialogue. This approach instructs the model to be more
5+
useful to directly respond to a question, rather than optimizing over an entire dialogue. SFT is the most common type of fine-tuning,
6+
as the other two options: Pre-training for Completion, and RLHF, required more to work. Pre-training requires more computational power,
7+
while RLHF requires higher-quality dialogue data.
8+
9+
This code should work on a regular CPU (in a docker container), which will allow you to test out the code locally without
10+
any additional setup. This specific approach this code uses is [LoRA](https://arxiv.org/abs/2106.09685) (low-rank adaptation of large language models), which
11+
means that only a subset of the LLM's parameters are tweaked and prevents over-fitting.
12+
13+
## What is fine-tuning?
14+
Fine-tuning is when a pre-trained model, in this context a foundational model, is customized using additional data to
15+
adjust its responses for a specific task. This is a good way to adjust an off-the-shelf, i.e. pretrained model, to provide
16+
more responses that are more contextually relevant to your use case.
17+
18+
## FLAN LLM
19+
This example is based on using [Google's Fine-tuned LAnguage Net (FLAN) models hosted on HuggingFace](https://huggingface.co/docs/transformers/model_doc/flan-t5).
20+
The larger the model, the longer it will take to fine-tune, and the more memory you'll need for it. The code
21+
here by default (which you can easily change) is set up to run on docker using the smallest FLAN model.
22+
23+
## What type of functionality is in this module?
24+
25+
The module uses libraries such as numpy, pandas, plotly, torch, sklearn, peft, evaluate, datasets, and transformers.
26+
27+
It shows a basic process of:
28+
29+
a. Loading data and tokenizing it and setting up some tokenization parameters.
30+
31+
b. Splitting data into training, validation, and inference sets.
32+
33+
c. Fine-tuning the model using LoRA.
34+
35+
d. Evaluating the fine-tuned model using the [rouge metric](https://en.wikipedia.org/wiki/ROUGE_(metric)).
36+
37+
You should be able to read the module top to bottom which corresponds roughly to the order of execution.
38+
39+
## How might I use this module?
40+
To use this module you'll need to do the following:
41+
42+
- Data. The data set should be list of JSON objects, where each entry in the list is an object that has the "question" and
43+
and also the "answer" to that question. You will provide the name of the keys for these fields as input to run the code.
44+
e.g. you should be able to do `json.load(f)` and it would return a list of dictionaries, e.g. something like this:
45+
46+
```python
47+
[
48+
{
49+
"question": "What is the meaning of life?",
50+
"reply": "42"
51+
},
52+
{
53+
"question": "What is Hamilton?",
54+
"reply": "..."
55+
},
56+
...
57+
]
58+
```
59+
60+
You would then pass in as _inputs_ to execution `"data_path"=PATH_TO_THIS_FILE` as well as `"input_text_key"="question"` and `"output_text_key"="reply"`.
61+
- Instantiate the driver. Use `{"start": "base"}` as configuration to run with to use a raw base LLM to finetune.
62+
- Pick your LLM. `model_id_tokenizer="google/mt5-small"` is the default, but you can change it to any of the models
63+
that the transformers library supports for `AutoModelForSeq2SeqLM` models.
64+
- Run the code.
65+
66+
Because there's no configuration that changes the shape of the DAG, you can run the code like this:
67+
68+
```python
69+
# instantiate the driver with this module however you want
70+
result = dr.execute(
71+
[ # some suggested outputs
72+
"save_best_models",
73+
"inference_set_predictions",
74+
"training_and_validation_set_metrics",
75+
"finetuned_model_on_validation_set",
76+
],
77+
inputs={
78+
"data_path": "example-support-dataset.json", # the path to your dataset
79+
"input_text_key": "question", # the key in the json object that has the input text
80+
"output_text_key": "gpt4_replies_target", # the key in the json object that has the target output text
81+
},
82+
)
83+
```
84+
85+
### Running the code in a docker container
86+
The easiest way to run this code is to use a docker container, unless you have experience with
87+
GPUs. After writing a module that uses the code here, e.g. `YOUR_RUN_FILE.py`, you can create a dockerfile
88+
that looks like this to then execute your fine-tuning code. Note, replace `example-support-dataset.json` with
89+
your dataset that you want to fine-tune on.
90+
91+
```docker
92+
FROM python:3.10-slim-bullseye
93+
94+
WORKDIR /app
95+
96+
# install graphviz backend
97+
RUN apt-get update \
98+
&& apt-get autoremove -yqq --purge \
99+
&& apt-get clean \
100+
&& rm -rf /var/lib/apt/lists/*
101+
102+
COPY requirements.txt .
103+
RUN pip install --no-cache-dir -r requirements.txt
104+
105+
COPY example-support-dataset.json .
106+
107+
COPY . .
108+
109+
EXPOSE 8080:8080
110+
111+
# run the code that you wrote that invokes this module
112+
CMD python YOUR_RUN_FILE.py
113+
```
114+
Then to run this it's just:
115+
```bash
116+
docker build -t YOUR_IMAGE_NAME .
117+
docker run YOUR_IMAGE_NAME
118+
```
119+
120+
# Configuration Options
121+
- `{"start": "base"}` Suggested configuration to run with to use a raw base LLM to finetune.
122+
- `{"start": "presaved"}` Use this if you want to load an already fine-tuned model and then just eval it.
123+
124+
# Limitations
125+
The code here cannot guarantee groundbreaking performance for your specific use case,
126+
but it can show you how to fine-tune an LLM using parameter-efficient techniques such as LoRA.
127+
128+
This code is currently set up to work with dataset and transformer libraries. It could be modified to work with other libraries.
129+
130+
The code here is all in a single module, it could be split out to be more modular, e.g. data loading vs tokenization vs finetuning vs evaluation.

0 commit comments

Comments
 (0)