|
| 1 | +# Purpose of this module |
| 2 | +This module shows you how to fine-tune an LLM model. This code is inspired by this [fine-tuning code](https://github.com/dagster-io/dagster_llm_finetune/tree/main). |
| 3 | + |
| 4 | +Specifically the code here, shows Supervised Fine-Tuning (SFT) for dialogue. This approach instructs the model to be more |
| 5 | +useful to directly respond to a question, rather than optimizing over an entire dialogue. SFT is the most common type of fine-tuning, |
| 6 | +as the other two options, Pre-training for Completion, and RLHF, required more to work. Pre-training requires more computational power, |
| 7 | +while RLHF requires higher-quality dialogue data. |
| 8 | + |
| 9 | +This code should work on a regular CPU (in a docker container), which will allow you to test out the code locally without |
| 10 | +any additional setup. This specific approach this code uses is [LoRA](https://arxiv.org/abs/2106.09685) (low-rank adaptation of large language models), which |
| 11 | +means that only a subset of the LLM's parameters are tweaked and prevents over-fitting. |
| 12 | + |
| 13 | +Note: if you have issues running this on MacOS, reach out, we might be able to help. |
| 14 | + |
| 15 | +## What is fine-tuning? |
| 16 | +Fine-tuning is when a pre-trained model, in this context a foundational model, is customized using additional data to |
| 17 | +adjust its responses for a specific task. This is a good way to adjust an off-the-shelf, i.e. pretrained model, to provide |
| 18 | +more responses that are more contextually relevant to your use case. |
| 19 | + |
| 20 | +## FLAN LLM |
| 21 | +This example is based on using [Google's Fine-tuned LAnguage Net (FLAN) models hosted on HuggingFace](https://huggingface.co/docs/transformers/model_doc/flan-t5). |
| 22 | +The larger the model, the longer it will take to fine-tune, and the more memory you'll need for it. The code |
| 23 | +here by default (which you can easily change) is set up to run on docker using the smallest FLAN model. |
| 24 | + |
| 25 | +## What type of functionality is in this module? |
| 26 | + |
| 27 | +The module uses libraries such as numpy, pandas, plotly, torch, sklearn, peft, evaluate, datasets, and transformers. |
| 28 | + |
| 29 | +It shows a basic process of: |
| 30 | + |
| 31 | +a. Loading data and tokenizing it and setting up some tokenization parameters. |
| 32 | + |
| 33 | +b. Splitting data into training, validation, and hold out sets. |
| 34 | + |
| 35 | +c. Fine-tuning the model using LoRA. |
| 36 | + |
| 37 | +d. Evaluating the fine-tuned model using the [rouge metric](https://en.wikipedia.org/wiki/ROUGE_(metric)). |
| 38 | + |
| 39 | +You should be able to read the module top to bottom which corresponds roughly to the order of execution. |
| 40 | + |
| 41 | +## How might I use this module? |
| 42 | +To use this module you'll need to do the following: |
| 43 | + |
| 44 | +- Data. The data set should be list of JSON objects, where each entry in the list is an object that has the "question" and |
| 45 | +and also the "answer" to that question. You will provide the name of the keys for these fields as input to run the code. |
| 46 | +e.g. you should be able to do `json.load(f)` and it would return a list of dictionaries, e.g. something like this: |
| 47 | + |
| 48 | +```python |
| 49 | +[ |
| 50 | + { |
| 51 | + "question": "What is the meaning of life?", |
| 52 | + "reply": "42" |
| 53 | + }, |
| 54 | + { |
| 55 | + "question": "What is Hamilton?", |
| 56 | + "reply": "..." |
| 57 | + }, |
| 58 | + ... |
| 59 | +] |
| 60 | +``` |
| 61 | + |
| 62 | +You would then pass in as _inputs_ to execution `"data_path"=PATH_TO_THIS_FILE` as well as `"input_text_key"="question"` and `"output_text_key"="reply"`. |
| 63 | +- Instantiate the driver. Use `{"start": "base"}` as configuration to run with to use a raw base LLM to finetune. |
| 64 | +- Pick your LLM. `model_id_tokenizer="google/mt5-small"` is the default, but you can change it to any of the models |
| 65 | +that the transformers library supports for `AutoModelForSeq2SeqLM` models. |
| 66 | +- Run the code. |
| 67 | + |
| 68 | +```python |
| 69 | +# instantiate the driver with this module however you want |
| 70 | +result = dr.execute( |
| 71 | + [ # some suggested outputs |
| 72 | + "save_best_models", |
| 73 | + "hold_out_set_predictions", |
| 74 | + "training_and_validation_set_metrics", |
| 75 | + "finetuned_model_on_validation_set", |
| 76 | + ], |
| 77 | + inputs={ |
| 78 | + "data_path": "example-support-dataset.json", # the path to your dataset |
| 79 | + "input_text_key": "question", # the key in the json object that has the input text |
| 80 | + "output_text_key": "gpt4_replies_target", # the key in the json object that has the target output text |
| 81 | + }, |
| 82 | +) |
| 83 | +``` |
| 84 | + |
| 85 | +### Running the code in a docker container |
| 86 | +The easiest way to run this code is to use a docker container, unless you have experience with |
| 87 | +GPUs. After writing a module that uses the code here, e.g. `YOUR_RUN_FILE.py`, you can create a dockerfile |
| 88 | +that looks like this to then execute your fine-tuning code. Note, replace `example-support-dataset.json` with |
| 89 | +your dataset that you want to fine-tune on. |
| 90 | + |
| 91 | +```docker |
| 92 | +FROM python:3.10-slim-bullseye |
| 93 | +
|
| 94 | +WORKDIR /app |
| 95 | +
|
| 96 | +# install graphviz backend |
| 97 | +RUN apt-get update \ |
| 98 | + && apt-get autoremove -yqq --purge \ |
| 99 | + && apt-get clean \ |
| 100 | + && rm -rf /var/lib/apt/lists/* |
| 101 | +
|
| 102 | +COPY requirements.txt . |
| 103 | +RUN pip install --no-cache-dir -r requirements.txt |
| 104 | +
|
| 105 | +# change this to your data set if you want to load |
| 106 | +# it into the container |
| 107 | +COPY example-support-dataset.json . |
| 108 | +
|
| 109 | +COPY . . |
| 110 | +
|
| 111 | +EXPOSE 8080:8080 |
| 112 | +
|
| 113 | +# run the code that you wrote that invokes this module |
| 114 | +CMD python YOUR_RUN_FILE.py |
| 115 | +``` |
| 116 | +Then to run this it's just: |
| 117 | +```bash |
| 118 | +docker build -t YOUR_IMAGE_NAME . |
| 119 | +docker run YOUR_IMAGE_NAME |
| 120 | +``` |
| 121 | + |
| 122 | +# Configuration Options |
| 123 | + - `{"start": "base"}` Suggested configuration to run with to use a raw base LLM to finetune. |
| 124 | + - `{"start": "presaved"}` Use this if you want to load an already fine-tuned model and then just eval it. |
| 125 | + |
| 126 | +# Limitations |
| 127 | +The code here will likely not solve all your LLM troubles, |
| 128 | +but it can show you how to fine-tune an LLM using parameter-efficient techniques such as LoRA. |
| 129 | + |
| 130 | +This code is currently set up to work with dataset and transformer libraries. It could be modified to work with other libraries. |
| 131 | + |
| 132 | +The code here is all in a single module, it could be split out to be more modular, e.g. data loading vs tokenization vs |
| 133 | +finetuning vs evaluation. |
0 commit comments