|
| 1 | +# NVIDIA Forum Scraper & Fine-tuning Pipeline |
| 2 | + |
| 3 | +Tools to scrape NVIDIA Developer Forum questions, enrich them using a local LLM, and fine-tune GPT-OSS-20B on NVIDIA DGX Spark hardware. |
| 4 | + |
| 5 | +## Features |
| 6 | + |
| 7 | +- **Scraper**: Downloads all questions with complete thread data from the NVIDIA Developer Forum. |
| 8 | +- **Dataset Creation**: Enriches raw forum threads using a local LLM to create high-quality Q&A pairs. |
| 9 | +- **Fine-tuning**: Scripts and Docker configuration to fine-tune GPT-OSS-20B using Unsloth on DGX Spark. |
| 10 | +- **Analysis**: Tools to analyze the downloaded forum data. |
| 11 | + |
| 12 | +## Requirements |
| 13 | + |
| 14 | +- Python 3.8+ |
| 15 | +- Docker (for fine-tuning) |
| 16 | +- Access to a local LLM server (e.g., via LM Studio) for dataset enrichment |
| 17 | + |
| 18 | +## Installation |
| 19 | + |
| 20 | +1. **Create a virtual environment:** |
| 21 | + |
| 22 | + ```bash |
| 23 | + python3 -m venv .venv |
| 24 | + source .venv/bin/activate |
| 25 | + ``` |
| 26 | + |
| 27 | +2. **Install dependencies:** |
| 28 | + |
| 29 | + ```bash |
| 30 | + pip install -r requirements.txt |
| 31 | + ``` |
| 32 | + |
| 33 | +## Usage |
| 34 | + |
| 35 | +### 1. Scrape Forum Questions |
| 36 | + |
| 37 | +Download questions from the NVIDIA Developer Forum (DGX Spark GB10 category). |
| 38 | + |
| 39 | +```bash |
| 40 | +# Basic usage (downloads to all_questions/) |
| 41 | +python download_nvidia_forum.py |
| 42 | + |
| 43 | +# Custom output directory |
| 44 | +python download_nvidia_forum.py -o my_questions |
| 45 | + |
| 46 | +# Adjust rate limiting (delay in seconds) |
| 47 | +python download_nvidia_forum.py -d 2.0 |
| 48 | +``` |
| 49 | + |
| 50 | +### 2. Create & Enrich Dataset |
| 51 | + |
| 52 | +Convert downloaded questions into a ShareGPT-style JSON dataset. This step uses a local LLM to clean and summarize the threads. |
| 53 | + |
| 54 | +1. **Configure LLM Servers:** |
| 55 | + Ensure `llm_config.json` is configured with your local LLM endpoints (e.g., LM Studio). |
| 56 | + |
| 57 | + ```json |
| 58 | + { |
| 59 | + "servers": [ |
| 60 | + { |
| 61 | + "url": "http://localhost:1234/v1/chat/completions", |
| 62 | + "model": "gpt-oss-20b", |
| 63 | + "timeout": 300, |
| 64 | + "max_tokens": 3000, |
| 65 | + "temperature": 0.5 |
| 66 | + } |
| 67 | + ] |
| 68 | + } |
| 69 | + ``` |
| 70 | + |
| 71 | +2. **Run Dataset Creation:** |
| 72 | + |
| 73 | + ```bash |
| 74 | + python create_dataset.py |
| 75 | + ``` |
| 76 | + |
| 77 | + This will process the JSON files in `all_questions/` and save the enriched dataset to `dataset/nvidia_solved_questions_enriched_llm.json`. |
| 78 | + |
| 79 | +### 3. Analyze Data (Optional) |
| 80 | + |
| 81 | +Analyze the downloaded questions using `analyze_questions.py`. |
| 82 | + |
| 83 | +```bash |
| 84 | +# Show statistics |
| 85 | +python analyze_questions.py -s |
| 86 | + |
| 87 | +# Search questions |
| 88 | +python analyze_questions.py -q "GPU" |
| 89 | +``` |
| 90 | + |
| 91 | +## Fine-tuning on DGX Spark |
| 92 | + |
| 93 | +This section explains how to fine-tune the GPT-OSS-20B model using Unsloth on NVIDIA DGX Spark hardware using the generated dataset. |
| 94 | + |
| 95 | +### 1. Build the Docker Image |
| 96 | + |
| 97 | +Use the provided `Dockerfile.dgx_spark` to build the image: |
| 98 | + |
| 99 | +```bash |
| 100 | +docker build -f Dockerfile.dgx_spark -t unsloth-dgx-spark . |
| 101 | +``` |
| 102 | + |
| 103 | +### 2. Launch the Container |
| 104 | + |
| 105 | +Run the container with GPU access and volume mounts: |
| 106 | + |
| 107 | +```bash |
| 108 | +docker run -it \ |
| 109 | + --gpus=all \ |
| 110 | + --net=host \ |
| 111 | + --ipc=host \ |
| 112 | + --ulimit memlock=-1 \ |
| 113 | + --ulimit stack=67108864 \ |
| 114 | + -v $(pwd):$(pwd) \ |
| 115 | + -v $HOME/.cache/huggingface:/root/.cache/huggingface \ |
| 116 | + -w $(pwd) \ |
| 117 | + unsloth-dgx-spark |
| 118 | +``` |
| 119 | + |
| 120 | +### 3. Run Fine-tuning |
| 121 | + |
| 122 | +Inside the container, run the fine-tuning script: |
| 123 | + |
| 124 | +```bash |
| 125 | +python3 finetune_gpt_oss_spark.py |
| 126 | +``` |
| 127 | + |
| 128 | +This script will: |
| 129 | +1. Load the `unsloth/gpt-oss-20b` model. |
| 130 | +2. Load the `dataset/nvidia_solved_questions_enriched_llm.json` dataset. |
| 131 | +3. Fine-tune the model using LoRA. |
| 132 | +4. Save the fine-tuned adapters to `lora_model/`. |
| 133 | + |
| 134 | +## Export to Ollama |
| 135 | + |
| 136 | +After fine-tuning, you can export the model to GGUF format and run it locally using Ollama. |
| 137 | + |
| 138 | +### 1. Export to GGUF |
| 139 | + |
| 140 | +Inside the Docker container (where you ran the fine-tuning), run the export script: |
| 141 | + |
| 142 | +```bash |
| 143 | +python3 export_to_ollama.py |
| 144 | +``` |
| 145 | + |
| 146 | +This will: |
| 147 | +1. Merge the LoRA adapters with the base model. |
| 148 | +2. Convert the model to GGUF format (quantized to q4_k_m). |
| 149 | +3. Generate a `Modelfile`. |
| 150 | +4. Save the output (e.g., `gpt-oss-20b.MXFP4.gguf`) in the current directory. |
| 151 | + |
| 152 | +### 2. Import to LM Studio |
| 153 | + |
| 154 | +You can also import the GGUF model directly into LM Studio: |
| 155 | + |
| 156 | +```bash |
| 157 | +lms import gpt-oss-20b.MXFP4.gguf |
| 158 | +``` |
| 159 | + |
| 160 | +### 3. Create Ollama Model |
| 161 | + |
| 162 | +Once the GGUF file and Modelfile are generated, you can create the Ollama model. You can do this inside the container (if Ollama is installed) or on your host machine. |
| 163 | + |
| 164 | +```bash |
| 165 | +./create_ollama_model.sh |
| 166 | +``` |
| 167 | + |
| 168 | +This script will: |
| 169 | +1. Detect the generated GGUF file and Modelfile. |
| 170 | +2. Run `ollama create gpt-oss-spark -f Modelfile`. |
| 171 | + |
| 172 | +### 4. Run the Model |
| 173 | + |
| 174 | +You can now chat with your fine-tuned model: |
| 175 | + |
| 176 | +```bash |
| 177 | +ollama run gpt-oss-spark |
| 178 | +``` |
| 179 | + |
| 180 | +## Cleanup |
| 181 | + |
| 182 | +To remove build artifacts, generated models, and temporary files (preserving datasets), run: |
| 183 | + |
| 184 | +```bash |
| 185 | +./cleanup.sh |
| 186 | +``` |
| 187 | + |
| 188 | +## License |
| 189 | + |
| 190 | +This project is provided for educational and research purposes. Please respect NVIDIA's terms of service. |
0 commit comments