LlamaForge/DATASETS_READY.txt at main · B-A-M-N/LlamaForge · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
═══════════════════════════════════════════════════════════════════════════════
                    🎉 GENUINE TRAINING DATASETS READY! 🎉
═══════════════════════════════════════════════════════════════════════════════

✅ Downloaded and ready in: testtraindata/

📦 WHAT YOU GOT:

1. Code Alpaca (1,000 examples)
   • Python code generation from natural language
   • Example: "Write a function to reverse a string" → Python code
   • Perfect for: Coding assistants, code completion, programming tutors

2. SQL Generation (500 examples)
   • Convert English questions to SQL queries
   • Example: "Find users over 30" → SELECT * FROM users WHERE age > 30
   • Perfect for: Database assistants, BI tools, query helpers

3. Practical Coding (5 curated examples)
   • High-quality algorithm implementations
   • Perfect for: Teaching specific patterns, validation sets

═══════════════════════════════════════════════════════════════════════════════

🚀 START TRAINING NOW:

Easiest way (one command):
    ./train_code_assistant.sh

Interactive mode:
    python llamaforge_interactive.py
    # Select codellama:latest
    # Choose testtraindata/code_alpaca.jsonl

Command-line mode:
    python llamaforge.py \
        --model codellama/CodeLlama-7b-hf \
        --data testtraindata/code_alpaca.jsonl \
        --epochs 3

═══════════════════════════════════════════════════════════════════════════════

💡 WHAT YOU'LL CREATE:

A fine-tuned model that can:
✓ Write Python functions from descriptions
✓ Generate SQL queries from English
✓ Implement algorithms and data structures
✓ Solve coding problems
✓ Create boilerplate code
✓ Explain code snippets

═══════════════════════════════════════════════════════════════════════════════

📚 FULL GUIDE:
   Read: testtraindata/README.md

🎯 RECOMMENDED MODEL:
   codellama:latest or qwen2.5-coder:7b
   (You have both in your Ollama collection!)

⏱️ TRAINING TIME:
   ~2-4 hours on your 20-core CPU

💾 OUTPUT SIZE:
   ~4-5 GB GGUF file

═══════════════════════════════════════════════════════════════════════════════

Ready to create a genuinely useful AI assistant! 🔥