Skip to content

Commit 4fa02a1

Browse files
authored
Adds fine tuning contrib example (#667)
* Adds fine tuning contrib example This module shows how one can perform fine tuning using Hamilton. It is a basic example using the transformers library that connects to huggingface to pull and fine tune a FLAN model. One should be able to adapt this code to their needs. * Updates to PR from feedback. * Adds extra comments for first tokenization functions * Updates docker file comment So that people know to change it to match their dataset. * Bumps version to get 0.0.7 out With latest contrib additions.
1 parent 6475418 commit 4fa02a1

File tree

8 files changed

+881
-1
lines changed

8 files changed

+881
-1
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
FROM python:3.10-slim-bullseye
2+
3+
WORKDIR /app
4+
5+
# install graphviz backend
6+
RUN apt-get update \
7+
&& apt-get autoremove -yqq --purge \
8+
&& apt-get clean \
9+
&& rm -rf /var/lib/apt/lists/*
10+
11+
COPY requirements.txt .
12+
RUN pip install --no-cache-dir -r requirements.txt
13+
14+
# change this to your data set if you want to load
15+
# it into the container
16+
COPY example-support-dataset.json .
17+
18+
COPY . .
19+
20+
EXPOSE 8080:8080
21+
22+
# run the module
23+
CMD python __init__.py
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# Purpose of this module
2+
This module shows you how to fine-tune an LLM model. This code is inspired by this [fine-tuning code](https://github.com/dagster-io/dagster_llm_finetune/tree/main).
3+
4+
Specifically the code here, shows Supervised Fine-Tuning (SFT) for dialogue. This approach instructs the model to be more
5+
useful to directly respond to a question, rather than optimizing over an entire dialogue. SFT is the most common type of fine-tuning,
6+
as the other two options, Pre-training for Completion, and RLHF, required more to work. Pre-training requires more computational power,
7+
while RLHF requires higher-quality dialogue data.
8+
9+
This code should work on a regular CPU (in a docker container), which will allow you to test out the code locally without
10+
any additional setup. This specific approach this code uses is [LoRA](https://arxiv.org/abs/2106.09685) (low-rank adaptation of large language models), which
11+
means that only a subset of the LLM's parameters are tweaked and prevents over-fitting.
12+
13+
Note: if you have issues running this on MacOS, reach out, we might be able to help.
14+
15+
## What is fine-tuning?
16+
Fine-tuning is when a pre-trained model, in this context a foundational model, is customized using additional data to
17+
adjust its responses for a specific task. This is a good way to adjust an off-the-shelf, i.e. pretrained model, to provide
18+
more responses that are more contextually relevant to your use case.
19+
20+
## FLAN LLM
21+
This example is based on using [Google's Fine-tuned LAnguage Net (FLAN) models hosted on HuggingFace](https://huggingface.co/docs/transformers/model_doc/flan-t5).
22+
The larger the model, the longer it will take to fine-tune, and the more memory you'll need for it. The code
23+
here by default (which you can easily change) is set up to run on docker using the smallest FLAN model.
24+
25+
## What type of functionality is in this module?
26+
27+
The module uses libraries such as numpy, pandas, plotly, torch, sklearn, peft, evaluate, datasets, and transformers.
28+
29+
It shows a basic process of:
30+
31+
a. Loading data and tokenizing it and setting up some tokenization parameters.
32+
33+
b. Splitting data into training, validation, and hold out sets.
34+
35+
c. Fine-tuning the model using LoRA.
36+
37+
d. Evaluating the fine-tuned model using the [rouge metric](https://en.wikipedia.org/wiki/ROUGE_(metric)).
38+
39+
You should be able to read the module top to bottom which corresponds roughly to the order of execution.
40+
41+
## How might I use this module?
42+
To use this module you'll need to do the following:
43+
44+
- Data. The data set should be list of JSON objects, where each entry in the list is an object that has the "question" and
45+
and also the "answer" to that question. You will provide the name of the keys for these fields as input to run the code.
46+
e.g. you should be able to do `json.load(f)` and it would return a list of dictionaries, e.g. something like this:
47+
48+
```python
49+
[
50+
{
51+
"question": "What is the meaning of life?",
52+
"reply": "42"
53+
},
54+
{
55+
"question": "What is Hamilton?",
56+
"reply": "..."
57+
},
58+
...
59+
]
60+
```
61+
62+
You would then pass in as _inputs_ to execution `"data_path"=PATH_TO_THIS_FILE` as well as `"input_text_key"="question"` and `"output_text_key"="reply"`.
63+
- Instantiate the driver. Use `{"start": "base"}` as configuration to run with to use a raw base LLM to finetune.
64+
- Pick your LLM. `model_id_tokenizer="google/mt5-small"` is the default, but you can change it to any of the models
65+
that the transformers library supports for `AutoModelForSeq2SeqLM` models.
66+
- Run the code.
67+
68+
```python
69+
# instantiate the driver with this module however you want
70+
result = dr.execute(
71+
[ # some suggested outputs
72+
"save_best_models",
73+
"hold_out_set_predictions",
74+
"training_and_validation_set_metrics",
75+
"finetuned_model_on_validation_set",
76+
],
77+
inputs={
78+
"data_path": "example-support-dataset.json", # the path to your dataset
79+
"input_text_key": "question", # the key in the json object that has the input text
80+
"output_text_key": "gpt4_replies_target", # the key in the json object that has the target output text
81+
},
82+
)
83+
```
84+
85+
### Running the code in a docker container
86+
The easiest way to run this code is to use a docker container, unless you have experience with
87+
GPUs. After writing a module that uses the code here, e.g. `YOUR_RUN_FILE.py`, you can create a dockerfile
88+
that looks like this to then execute your fine-tuning code. Note, replace `example-support-dataset.json` with
89+
your dataset that you want to fine-tune on.
90+
91+
```docker
92+
FROM python:3.10-slim-bullseye
93+
94+
WORKDIR /app
95+
96+
# install graphviz backend
97+
RUN apt-get update \
98+
&& apt-get autoremove -yqq --purge \
99+
&& apt-get clean \
100+
&& rm -rf /var/lib/apt/lists/*
101+
102+
COPY requirements.txt .
103+
RUN pip install --no-cache-dir -r requirements.txt
104+
105+
# change this to your data set if you want to load
106+
# it into the container
107+
COPY example-support-dataset.json .
108+
109+
COPY . .
110+
111+
EXPOSE 8080:8080
112+
113+
# run the code that you wrote that invokes this module
114+
CMD python YOUR_RUN_FILE.py
115+
```
116+
Then to run this it's just:
117+
```bash
118+
docker build -t YOUR_IMAGE_NAME .
119+
docker run YOUR_IMAGE_NAME
120+
```
121+
122+
# Configuration Options
123+
- `{"start": "base"}` Suggested configuration to run with to use a raw base LLM to finetune.
124+
- `{"start": "presaved"}` Use this if you want to load an already fine-tuned model and then just eval it.
125+
126+
# Limitations
127+
The code here will likely not solve all your LLM troubles,
128+
but it can show you how to fine-tune an LLM using parameter-efficient techniques such as LoRA.
129+
130+
This code is currently set up to work with dataset and transformer libraries. It could be modified to work with other libraries.
131+
132+
The code here is all in a single module, it could be split out to be more modular, e.g. data loading vs tokenization vs
133+
finetuning vs evaluation.

0 commit comments

Comments
 (0)