Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -148,3 +148,6 @@ dmypy.json
.cache/

.DS_Store

AGENTS.md
CLAUDE.md
143 changes: 140 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,24 @@ Why xTuring:
pip install xturing
```

### Development Installation

If you want to contribute to xTuring or run from source:

```bash
# Clone the repository
git clone https://github.com/stochasticai/xturing.git
cd xturing

# Install in editable mode with development dependencies
pip install -e .
pip install -r requirements-dev.txt

# Set up pre-commit hooks (required before contributing)
pre-commit install
pre-commit install --hook-type commit-msg
```

<br>

## 🚀 Quickstart
Expand Down Expand Up @@ -158,7 +176,7 @@ dataset = InstructionDataset('../llama/alpaca_data')
model = GenericLoraKbitModel('tiiuae/falcon-7b')

# Generate outputs on desired prompts
outputs = model.generate(dataset = dataset, batch_size=10)
outputs = model.generate(dataset = dataset, batch_size=10)

```

Expand All @@ -173,6 +191,16 @@ model.finetune(dataset=dataset)
```
> See `examples/models/qwen3/qwen3_lora_finetune.py` for a runnable script.

8. __Qwen3-Omni dataset generation__ – Run the multimodal checkpoint locally (download from Hugging Face) to bootstrap instruction corpora without leaving your machine.
```python
from xturing.datasets import InstructionDataset
from xturing.model_apis.qwen import Qwen3OmniTextGenerationAPI

# Download `Qwen/Qwen3-Omni-30B-A3B-Instruct` (or another HF variant) ahead of time
engine = Qwen3OmniTextGenerationAPI(model_name_or_path="Qwen/Qwen3-Omni-30B-A3B-Instruct")
dataset = InstructionDataset.generate_dataset("./tasks.jsonl", engine=engine)
```

An exploration of the [Llama LoRA INT4 working example](examples/features/int4_finetuning/LLaMA_lora_int4.ipynb) is recommended for an understanding of its application.

For an extended insight, consider examining the [GenericModel working example](examples/features/generic/generic_model.py) available in the repository.
Expand All @@ -182,9 +210,17 @@ For an extended insight, consider examining the [GenericModel working example](e
## CLI playground
<img src=".github/cli-playground.gif" width="80%" style="margin: 0 1%;"/>

The `xturing` CLI provides interactive tools for working with fine-tuned models:

```bash
$ xturing chat -m "<path-to-model-folder>"
# Chat with a fine-tuned model
xturing chat -m "<path-to-model-folder>"

# Launch the UI playground (alternative to programmatic Playground)
xturing ui

# Get help and see all available commands
xturing --help
```

## UI playground
Expand All @@ -210,6 +246,8 @@ Playground().launch() ## launches localhost UI

## 📚 Tutorials
- [Preparing your dataset](examples/datasets/preparing_your_dataset.py)
- [SIFT-50M dataset helpers](examples/datasets/README.md)
- [Qwen3-Omni HF/PEFT template (A100/H100)](examples/models/qwen3_omni/README.md)
- [Cerebras-GPT fine-tuning with LoRA and INT8](examples/models/cerebras/cerebras_lora_int8.ipynb) &ensp; [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1eKq3oF7dnK8KuIfsTE70Gvvniwr1O9D0?usp=sharing)
- [Cerebras-GPT fine-tuning with LoRA](examples/models/cerebras/cerebras_lora.ipynb) &ensp; [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1VjqQhstm5pT4EjPjx4Je7b3W2X1V3vDo?usp=sharing)
- [LLaMA fine-tuning with LoRA and INT8](examples/models/llama/llama_lora_int8.py) &ensp; [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SQUXq1AMZPSLD4mk3A3swUIc6Y2dclme?usp=sharing)
Expand Down Expand Up @@ -250,13 +288,27 @@ Contribute to this by submitting your performance results on other GPUs by creat

## 📎 Fine‑tuned model checkpoints
We have already fine-tuned some models that you can use as your base or start playing with.
Here is how you would load them:

### Loading Models

**Load from xTuring hub:**
```python
from xturing.models import BaseModel
model = BaseModel.load("x/distilgpt2_lora_finetuned_alpaca")
```

**Load from local directory:**
```python
model = BaseModel.load("/path/to/saved/model")
```

**Create a new model for fine-tuning:**
```python
model = BaseModel.create("llama_lora")
```

### Available Pre-trained Models

| model | dataset | Path |
|---------------------|--------|---------------|
| DistilGPT-2 LoRA | alpaca | `x/distilgpt2_lora_finetuned_alpaca` |
Expand All @@ -281,6 +333,7 @@ Below is a list of all the supported models via `BaseModel` class of `xTuring` a
|LLaMA2 | llama2|
|MiniMaxM2 | minimax_m2|
|OPT-1.3B | opt|
|Qwen3-0.6B | qwen3_0_6b|

The above are the base variants. Use these templates for `LoRA`, `INT8`, and `INT8 + LoRA` versions:

Expand Down Expand Up @@ -314,17 +367,101 @@ Replace `<model_path>` with a local directory or a Hugging Face model like `face

<br>

## 🧪 Running Tests

The project uses pytest for testing. Test files are located in the `tests/` directory.

Run all tests:
```bash
pytest
```

Run a specific test file:
```bash
pytest tests/xturing/models/test_qwen_model.py
```

Skip slow tests:
```bash
pytest -m "not slow"
```

Skip GPU tests (for CPU-only environments):
```bash
pytest -m "not gpu"
```

Test markers used in this project:
- `@pytest.mark.slow` - Tests that take significant time to run
- `@pytest.mark.gpu` - Tests requiring GPU hardware

<br>

## 🤝 Help and Support
If you have any questions, you can create an issue on this repository.

You can also join our [Discord server](https://discord.gg/TgHXuSJEk6) and start a discussion in the `#xturing` channel.

<br>

## 🏗️ Project Structure

Understanding the codebase organization:

```
src/xturing/
├── models/ # Model classes and registry (BaseModel, LLaMA, GPT-2, etc.)
├── engines/ # Low-level model loading, tokenization, and operations
├── datasets/ # Dataset loaders (InstructionDataset, TextDataset)
├── trainers/ # Training loops (LightningTrainer with DeepSpeed support)
├── preprocessors/ # Data preprocessing and tokenization
├── config/ # YAML configurations for finetuning and generation
├── cli/ # CLI commands (chat, ui, api)
├── ui/ # Gradio UI playground
├── self_instruct/ # Dataset generation utilities
└── utils/ # Shared utilities

tests/xturing/ # Test suite mirroring src structure
examples/ # Example scripts organized by model and feature
```

**Key architectural patterns:**
- **Registry Pattern**: Models and engines use a registry-based factory pattern via `BaseModel.create()` and `BaseEngine.create()`
- **Model Variants**: Each model family has multiple variants following the naming template `<base>_[lora]_[int8|kbit]`
- Example: `llama`, `llama_lora`, `llama_int8`, `llama_lora_int8`
- **Configuration**: Training and generation parameters are defined in YAML files per model in `src/xturing/config/`
- **Engines**: Handle the low-level operations (loading weights, tokenization, DeepSpeed integration)
- **Models**: Provide high-level API (`finetune()`, `generate()`, `evaluate()`, `save()`, `load()`)

<br>

## 📝 License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

<br>

## 🌎 Contributing
As an open source project in a rapidly evolving field, we welcome contributions of all kinds, including new features and better documentation. Please read our [contributing guide](CONTRIBUTING.md) to learn how you can get involved.

### Quick Contribution Guidelines

**Important:** All pull requests should target the `dev` branch, not `main`.

The project uses pre-commit hooks to enforce code quality:
- **black** - Code formatting
- **isort** - Import sorting (black profile)
- **autoflake** - Remove unused imports
- **absolufy-imports** - Convert relative to absolute imports
- **gitlint** - Commit message linting

You can manually format code:
```bash
black src/ tests/
isort src/ tests/
```

Pre-commit hooks will automatically run these checks when you commit. Make sure to install them:
```bash
pre-commit install
pre-commit install --hook-type commit-msg
```
10 changes: 10 additions & 0 deletions docs/docs/advanced/generate.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,16 @@ engine = Davinci("your-api-key")
engine = ClaudeSonnet("your-api-key")
```

</TabItem>
<TabItem value="qwen" label="Qwen3-Omni (local)">

Download the desired checkpoint from [Hugging Face](https://huggingface.co/Qwen/Qwen2.5-Omni) (or point to a local directory) and load it directly.

```python
from xturing.model_apis.qwen import Qwen3OmniTextGenerationAPI
engine = Qwen3OmniTextGenerationAPI(model_name_or_path="Qwen/Qwen2.5-Omni")
```

</TabItem>
</Tabs>

Expand Down
4 changes: 4 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ examples/

### datsets/
This directory consists of multiple ways to generate your custom dataset from a given set of examples.
Also includes SIFT-50M helpers:
- `examples/datasets/sift50m_subset_builder.py` builds a small English subset.
- `examples/datasets/sift50m_audio_mapper.py` resolves `audio_path` to local files.
- `examples/datasets/README.md` contains full CLI recipes.

### features/
This directory consists of files with exapmles highlighting speific major features of the library, which can be replicated to any LLM you want.
Expand Down
58 changes: 58 additions & 0 deletions examples/datasets/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Datasets

This folder includes dataset helpers and recipes used by xTuring examples.

## SIFT-50M helpers (English subsets)

### 1) Build a small English subset

Filters `amazon-agi/SIFT-50M` to English plus:
- `closed_ended_content_level`
- `open_ended`
- optional `controllable_generation`

```bash
python examples/datasets/sift50m_subset_builder.py \
--output-dir ./data/sift50m_en_small \
--max-examples 100000 \
--include-controllable-generation \
--jsonl
```

Notes:
- Use `--language-col` or `--category-col` if the dataset schema changes.
- Set `--max-examples 0` to keep all rows after filtering.

### 2) Resolve audio paths to local files

SIFT-50M includes `audio_path` and (often) `data_source`. This script adds a
resolved `audio_file` column and can drop rows with missing files.

```bash
python examples/datasets/sift50m_audio_mapper.py \
--input-dir ./data/sift50m_en_small \
--output-dir ./data/sift50m_en_small_mapped \
--audio-root mls=/data/mls \
--audio-root cv15=/data/commonvoice15 \
--audio-root vctk=/data/vctk \
--verify-exists \
--drop-missing \
--jsonl
```

If your dataset uses different columns:

```bash
python examples/datasets/sift50m_audio_mapper.py \
--input-dir ./data/sift50m_en_small \
--output-dir ./data/sift50m_en_small_mapped \
--audio-path-col audio_path \
--data-source-col data_source
```

## Outputs

Each script writes:
- a Hugging Face dataset directory (via `save_to_disk`)
- `subset.jsonl` (if `--jsonl` is set)
- a `*_meta.json` file with the filter settings and detected columns
Loading