A comprehensive framework for fine-tuning and evaluating Qwen models using Marimo.
Inspired by Eric Livesay's talk on Fine Tuning Qwen at UTA Python meetup (October 16th, 2025)
For more details on the original notebook: https://github.com/elivesay/elivesay.github.io/blob/main/qwen_finetuning_updated.ipynb
- Interactive Marimo Interface: Modern, reactive UI for fine-tuning and evaluation
- Comprehensive Evaluation: Multiple metrics including ROUGE, BLEU, BERTScore, and custom metrics
- Advanced Training: LoRA fine-tuning with quantization support
- Data Processing: Flexible data loading from files, URLs, and HuggingFace datasets
- Visualization: Training curves, evaluation metrics, and performance analysis
- Model Comparison: Side-by-side comparison of different models
- Automated Reporting: Generate detailed evaluation reports
-
Python 3.11+
-
PyTorch 2.0+
-
Transformers 4.45+
-
Marimo 0.8+
-
Hugging Face Token (for model downloads and uploads)
- Clone the repository:
git clone git@github.com:jcamier/qwen-tuning-evals.git
cd QwenEvals
uv venv
source .venv/bin/activate- Install dependencies:
# Using uv (recommended)
uv sync- Set up environment variables:
cp env.example .env
# Edit .env with your API keys and configurationYou'll need a Hugging Face token to download models and potentially upload your fine-tuned models. Here's how to get one:
-
Create a Hugging Face Account:
- Go to huggingface.co
- Click "Sign Up" and create a free account
-
Generate an Access Token:
- Log into your Hugging Face account
- Go to Settings β Access Tokens
- Click "New token"
- Choose "Write" permission (needed for uploading models)
- Give it a name like "Qwen Fine-tuning"
- Click "Generate a token"
- Copy the token immediately (you won't see it again!)
-
Add Token to Your Environment:
# Edit your .env file nano .env # Add your token: HUGGINGFACE_HUB_TOKEN=hf_your_token_here
-
Login via CLI (Alternative):
# Install huggingface_hub if not already installed pip install huggingface_hub # Login with your token huggingface-cli login # Enter your token when prompted
Why you need this token:
- Download Qwen models from Hugging Face Hub
- Upload your fine-tuned models (optional)
- Access gated models if needed
- Avoid rate limits on model downloads
Weights & Biases (wandb) is used for experiment tracking, logging training metrics, and visualizing results. Here's how to set it up:
-
Create a Weights & Biases Account:
- Go to wandb.ai
- Click "Sign Up" and create a free account
- Verify your email address
-
Get Your API Key:
- Log into your wandb account
- Go to Settings β API Keys
- Click "Create new key"
- Give it a name like "Qwen Fine-tuning"
- Copy the API key immediately (you won't see it again!)
-
Add API Key to Your Environment:
# Edit your .env file nano .env # Add your API key: WANDB_API_KEY=your_api_key_here WANDB_PROJECT=qwen-finetuning-evals
-
Login via CLI (Alternative):
# Login with your API key wandb login # Enter your API key when prompted
Why you need wandb:
- Track training progress and metrics in real-time
- Visualize loss curves and evaluation metrics
- Compare different model runs
- Share results with team members
- Automatic logging of hyperparameters and results
Create sample data or use your own:
from data_preparation import create_sample_training_data, create_sample_evaluation_data
# Create sample training data
create_sample_training_data("data/training_data.txt")
# Create sample evaluation data
create_sample_evaluation_data("data/evaluation_data.json")# Easy way - launches everything
python launch.py
# Or run directly
marimo edit qwen_notebook.pyThe notebook is organized in steps - run each cell one at a time:
- Imports & Setup: Load all required libraries
- Configuration: Review the model and training parameters
- Load Data: Load and chunk your training data
- Load Model: Load Qwen model with LoRA configuration
- Tokenize Dataset: Prepare data for training
- Training: Fine-tune the model (takes several minutes)
- Evaluation: Test the fine-tuned model with sample prompts
- Summary: Review your results
The framework provides comprehensive evaluation capabilities:
- ROUGE Scores: ROUGE-1, ROUGE-2, ROUGE-L
- BLEU Score: Bilingual Evaluation Understudy
- BERTScore: Contextual embedding-based similarity
- Exact Match: Perfect string matching
- Generation Speed: Tokens per second, generation time
- Single Sample Evaluation: Evaluate individual prompt-response pairs
- Batch Evaluation: Process multiple samples efficiently
- Dataset Evaluation: Comprehensive evaluation on full datasets
- Model Comparison: Compare different models side-by-side
from evaluation_framework import QwenEvaluator, EvaluationConfig
# Initialize evaluator
evaluator = QwenEvaluator("./outputs")
# Configure evaluation
config = EvaluationConfig(
max_new_tokens=256,
temperature=0.7,
top_p=0.9
)
# Evaluate single sample
metrics = evaluator.evaluate_single(
prompt="What is machine learning?",
reference="Machine learning is a subset of AI...",
config=config
)
print(f"ROUGE-1: {metrics.rouge_1:.4f}")
print(f"BLEU: {metrics.bleu:.4f}")The framework supports multiple data sources:
- Text Files: Plain text files
- URLs: Web pages and online content
- HuggingFace Datasets: Direct integration with HF datasets
- JSON/CSV: Structured data formats
from data_preparation import DataPreprocessor, DataConfig
# Configure data processing
config = DataConfig(
chunk_size=2000,
chunk_overlap=200,
use_chat_template=True
)
preprocessor = DataPreprocessor(config)
# Load and process data
text = preprocessor.load_text_file("data/training_data.txt")
chunks = preprocessor.chunk_text(text)
dataset = preprocessor.create_training_dataset(chunks)
# Validate and save
stats = preprocessor.validate_dataset(dataset)
preprocessor.save_dataset(dataset, "data/processed_data.json")Compare different models or configurations:
# Compare two models
comparison = evaluator.compare_models(
other_model_path="./other_model",
prompts=test_prompts,
references=test_references
)
print(f"ROUGE-1 improvement: {comparison['rouge_1_improvement']:.4f}")Add your own evaluation metrics:
def custom_metric(prediction: str, reference: str) -> float:
# Your custom metric implementation
return similarity_score
# Use in evaluation
metrics.custom_score = custom_metric(prediction, reference)Generate comprehensive evaluation reports:
# Generate report
report = evaluator.generate_report(evaluation_results, "evaluation_report.md")
# Create visualizations
plot_files = evaluator.create_visualizations(
evaluation_results,
"evaluation_plots/"
)@dataclass
class TrainingConfig:
model_name: str = "Qwen/Qwen3-0.6B"
epochs: int = 3
learning_rate: float = 2e-5
batch_size: int = 2
max_length: int = 1024
lora_r: int = 8
lora_alpha: int = 16
lora_dropout: float = 0.05
use_4bit: bool = False
use_8bit: bool = False@dataclass
class EvaluationConfig:
max_new_tokens: int = 256
temperature: float = 0.7
top_p: float = 0.9
top_k: int = 50
repetition_penalty: float = 1.1
num_beams: int = 1
do_sample: bool = True# 1. Prepare data
preprocessor = DataPreprocessor()
text = preprocessor.load_text_file("my_data.txt")
chunks = preprocessor.chunk_text(text)
dataset = preprocessor.create_training_dataset(chunks)
# 2. Configure training
config = TrainingConfig(
model_name="Qwen/Qwen3-0.6B",
epochs=3,
learning_rate=2e-5
)
# 3. Train model (using Marimo interface)
# The Marimo app handles the training process# 1. Load evaluation data
eval_data = load_evaluation_dataset("evaluation_data.json")
# 2. Initialize evaluator
evaluator = QwenEvaluator("./fine_tuned_model")
# 3. Run evaluation
results = evaluator.evaluate_dataset(eval_data)
# 4. Generate report
report = evaluator.generate_report(results, "evaluation_report.md")- CUDA Out of Memory: Reduce batch size or use gradient accumulation
- Model Loading Errors: Ensure you have the correct model path and sufficient disk space
- Evaluation Timeout: Reduce max_new_tokens or use smaller evaluation sets
- Use Quantization: Enable 4-bit or 8-bit quantization for memory efficiency
- Batch Processing: Process evaluations in batches for better performance
- Caching: Cache tokenized data to avoid reprocessing
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Eric Livesay for the inspiring talk on Fine Tuning Qwen at UTA Python meetup
- Qwen Team for the excellent Qwen models
- Marimo Team for the innovative notebook framework
- Hugging Face for the transformers library
For questions and support:
- Open an issue on GitHub
- Check the documentation
- Join our community discussions