stochasticai · glennko · Feb 5, 2026 · Feb 9, 2026 · Feb 8, 2026 · Feb 9, 2026
diff --git a/.gitignore b/.gitignore
@@ -148,3 +148,6 @@ dmypy.json
 .cache/
 
 .DS_Store
+
+AGENTS.md
+CLAUDE.md
diff --git a/README.md b/README.md
@@ -36,6 +36,24 @@ Why xTuring:
 pip install xturing
 ```
 
+### Development Installation
+
+If you want to contribute to xTuring or run from source:
+
+```bash
+# Clone the repository
+git clone https://github.com/stochasticai/xturing.git
+cd xturing
+
+# Install in editable mode with development dependencies
+pip install -e .
+pip install -r requirements-dev.txt
+
+# Set up pre-commit hooks (required before contributing)
+pre-commit install
+pre-commit install --hook-type commit-msg
+```
+
 <br>
 
 ## 🚀 Quickstart
@@ -158,7 +176,7 @@ dataset = InstructionDataset('../llama/alpaca_data')
 model = GenericLoraKbitModel('tiiuae/falcon-7b')
 
 # Generate outputs on desired prompts
-outputs = model.generate(dataset = dataset, batch_size=10)
+ outputs = model.generate(dataset = dataset, batch_size=10)
 
 ```
 
@@ -173,6 +191,16 @@ model.finetune(dataset=dataset)
 ```
 > See `examples/models/qwen3/qwen3_lora_finetune.py` for a runnable script.
 
+8. __Qwen3-Omni dataset generation__ – Run the multimodal checkpoint locally (download from Hugging Face) to bootstrap instruction corpora without leaving your machine.
+```python
+from xturing.datasets import InstructionDataset
+from xturing.model_apis.qwen import Qwen3OmniTextGenerationAPI
+
+# Download `Qwen/Qwen3-Omni-30B-A3B-Instruct` (or another HF variant) ahead of time
+engine = Qwen3OmniTextGenerationAPI(model_name_or_path="Qwen/Qwen3-Omni-30B-A3B-Instruct")
+dataset = InstructionDataset.generate_dataset("./tasks.jsonl", engine=engine)
+```
+
 An exploration of the [Llama LoRA INT4 working example](examples/features/int4_finetuning/LLaMA_lora_int4.ipynb) is recommended for an understanding of its application.
 
 For an extended insight, consider examining the [GenericModel working example](examples/features/generic/generic_model.py) available in the repository.
@@ -182,9 +210,17 @@ For an extended insight, consider examining the [GenericModel working example](e
 ## CLI playground
 <img src=".github/cli-playground.gif" width="80%" style="margin: 0 1%;"/>
 
+The `xturing` CLI provides interactive tools for working with fine-tuned models:
+
 ```bash
-$ xturing chat -m "<path-to-model-folder>"
+# Chat with a fine-tuned model
+xturing chat -m "<path-to-model-folder>"
+
+# Launch the UI playground (alternative to programmatic Playground)
+xturing ui
 
+# Get help and see all available commands
+xturing --help
 ```
 
 ## UI playground
@@ -210,6 +246,8 @@ Playground().launch() ## launches localhost UI
 
 ## 📚 Tutorials
 - [Preparing your dataset](examples/datasets/preparing_your_dataset.py)
+- [SIFT-50M dataset helpers](examples/datasets/README.md)
+- [Qwen3-Omni HF/PEFT template (A100/H100)](examples/models/qwen3_omni/README.md)
 - [Cerebras-GPT fine-tuning with LoRA and INT8](examples/models/cerebras/cerebras_lora_int8.ipynb) &ensp; [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1eKq3oF7dnK8KuIfsTE70Gvvniwr1O9D0?usp=sharing)
 - [Cerebras-GPT fine-tuning with LoRA](examples/models/cerebras/cerebras_lora.ipynb) &ensp; [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1VjqQhstm5pT4EjPjx4Je7b3W2X1V3vDo?usp=sharing)
 - [LLaMA fine-tuning with LoRA and INT8](examples/models/llama/llama_lora_int8.py) &ensp; [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SQUXq1AMZPSLD4mk3A3swUIc6Y2dclme?usp=sharing)
@@ -250,13 +288,27 @@ Contribute to this by submitting your performance results on other GPUs by creat
 
 ## 📎 Fine‑tuned model checkpoints
 We have already fine-tuned some models that you can use as your base or start playing with.
-Here is how you would load them:
 
+### Loading Models
+
+**Load from xTuring hub:**
 ```python
 from xturing.models import BaseModel
 model = BaseModel.load("x/distilgpt2_lora_finetuned_alpaca")
 ```
 
+**Load from local directory:**
+```python
+model = BaseModel.load("/path/to/saved/model")
+```
+
+**Create a new model for fine-tuning:**
+```python
+model = BaseModel.create("llama_lora")
+```
+
+### Available Pre-trained Models
+
 | model               | dataset | Path          |
 |---------------------|--------|---------------|
 | DistilGPT-2 LoRA | alpaca | `x/distilgpt2_lora_finetuned_alpaca` |
@@ -281,6 +333,7 @@ Below is a list of all the supported models via `BaseModel` class of `xTuring` a
 |LLaMA2 | llama2|
 |MiniMaxM2 | minimax_m2|
 |OPT-1.3B | opt|
+|Qwen3-0.6B | qwen3_0_6b|
 
 The above are the base variants. Use these templates for `LoRA`, `INT8`, and `INT8 + LoRA` versions:
 
@@ -314,17 +367,101 @@ Replace `<model_path>` with a local directory or a Hugging Face model like `face
 
 <br>
 
+## 🧪 Running Tests
+
+The project uses pytest for testing. Test files are located in the `tests/` directory.
+
+Run all tests:
+```bash
+pytest
+```
+
+Run a specific test file:
+```bash
+pytest tests/xturing/models/test_qwen_model.py
+```
+
+Skip slow tests:
+```bash
+pytest -m "not slow"
+```
+
+Skip GPU tests (for CPU-only environments):
+```bash
+pytest -m "not gpu"
+```
+
+Test markers used in this project:
+- `@pytest.mark.slow` - Tests that take significant time to run
+- `@pytest.mark.gpu` - Tests requiring GPU hardware
+
+<br>
+
 ## 🤝 Help and Support
 If you have any questions, you can create an issue on this repository.
 
 You can also join our [Discord server](https://discord.gg/TgHXuSJEk6) and start a discussion in the `#xturing` channel.
 
 <br>
 
+## 🏗️ Project Structure
+
+Understanding the codebase organization:
+
+```
+src/xturing/
+├── models/          # Model classes and registry (BaseModel, LLaMA, GPT-2, etc.)
+├── engines/         # Low-level model loading, tokenization, and operations
+├── datasets/        # Dataset loaders (InstructionDataset, TextDataset)
+├── trainers/        # Training loops (LightningTrainer with DeepSpeed support)
+├── preprocessors/   # Data preprocessing and tokenization
+├── config/          # YAML configurations for finetuning and generation
+├── cli/             # CLI commands (chat, ui, api)
+├── ui/              # Gradio UI playground
+├── self_instruct/   # Dataset generation utilities
+└── utils/           # Shared utilities
+
+tests/xturing/       # Test suite mirroring src structure
+examples/            # Example scripts organized by model and feature
+```
+
+**Key architectural patterns:**
+- **Registry Pattern**: Models and engines use a registry-based factory pattern via `BaseModel.create()` and `BaseEngine.create()`
+- **Model Variants**: Each model family has multiple variants following the naming template `<base>_[lora]_[int8|kbit]`
+  - Example: `llama`, `llama_lora`, `llama_int8`, `llama_lora_int8`
+- **Configuration**: Training and generation parameters are defined in YAML files per model in `src/xturing/config/`
+- **Engines**: Handle the low-level operations (loading weights, tokenization, DeepSpeed integration)
+- **Models**: Provide high-level API (`finetune()`, `generate()`, `evaluate()`, `save()`, `load()`)
+
+<br>
+
 ## 📝 License
 This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
 
 <br>
 
 ## 🌎 Contributing
 As an open source project in a rapidly evolving field, we welcome contributions of all kinds, including new features and better documentation. Please read our [contributing guide](CONTRIBUTING.md) to learn how you can get involved.
+
+### Quick Contribution Guidelines
+
+**Important:** All pull requests should target the `dev` branch, not `main`.
+
+The project uses pre-commit hooks to enforce code quality:
+- **black** - Code formatting
+- **isort** - Import sorting (black profile)
+- **autoflake** - Remove unused imports
+- **absolufy-imports** - Convert relative to absolute imports
+- **gitlint** - Commit message linting
+
+You can manually format code:
+```bash
+black src/ tests/
+isort src/ tests/
+```
+
+Pre-commit hooks will automatically run these checks when you commit. Make sure to install them:
+```bash
+pre-commit install
+pre-commit install --hook-type commit-msg
+```
diff --git a/docs/docs/advanced/generate.md b/docs/docs/advanced/generate.md
@@ -41,6 +41,16 @@ engine = Davinci("your-api-key")
   engine = ClaudeSonnet("your-api-key")
   ```
 
+  </TabItem>
+  <TabItem value="qwen" label="Qwen3-Omni (local)">
+
+  Download the desired checkpoint from [Hugging Face](https://huggingface.co/Qwen/Qwen2.5-Omni) (or point to a local directory) and load it directly.
+
+  ```python
+  from xturing.model_apis.qwen import Qwen3OmniTextGenerationAPI
+  engine = Qwen3OmniTextGenerationAPI(model_name_or_path="Qwen/Qwen2.5-Omni")
+  ```
+
   </TabItem>
 </Tabs>
 

diff --git a/examples/README.md b/examples/README.md
@@ -16,6 +16,10 @@ examples/
 
 ### datsets/
 This directory consists of multiple ways to generate your custom dataset from a given set of examples.
+Also includes SIFT-50M helpers:
+- `examples/datasets/sift50m_subset_builder.py` builds a small English subset.
+- `examples/datasets/sift50m_audio_mapper.py` resolves `audio_path` to local files.
+- `examples/datasets/README.md` contains full CLI recipes.
 
 ### features/
 This directory consists of files with exapmles highlighting speific major features of the library, which can be replicated to any LLM you want.

diff --git a/examples/datasets/README.md b/examples/datasets/README.md
@@ -0,0 +1,58 @@
+# Datasets
+
+This folder includes dataset helpers and recipes used by xTuring examples.
+
+## SIFT-50M helpers (English subsets)
+
+### 1) Build a small English subset
+
+Filters `amazon-agi/SIFT-50M` to English plus:
+- `closed_ended_content_level`
+- `open_ended`
+- optional `controllable_generation`
+
+```bash
+python examples/datasets/sift50m_subset_builder.py \
+  --output-dir ./data/sift50m_en_small \
+  --max-examples 100000 \
+  --include-controllable-generation \
+  --jsonl
+```
+
+Notes:
+- Use `--language-col` or `--category-col` if the dataset schema changes.
+- Set `--max-examples 0` to keep all rows after filtering.
+
+### 2) Resolve audio paths to local files
+
+SIFT-50M includes `audio_path` and (often) `data_source`. This script adds a
+resolved `audio_file` column and can drop rows with missing files.
+
+```bash
+python examples/datasets/sift50m_audio_mapper.py \
+  --input-dir ./data/sift50m_en_small \
+  --output-dir ./data/sift50m_en_small_mapped \
+  --audio-root mls=/data/mls \
+  --audio-root cv15=/data/commonvoice15 \
+  --audio-root vctk=/data/vctk \
+  --verify-exists \
+  --drop-missing \
+  --jsonl
+```
+
+If your dataset uses different columns:
+
+```bash
+python examples/datasets/sift50m_audio_mapper.py \
+  --input-dir ./data/sift50m_en_small \
+  --output-dir ./data/sift50m_en_small_mapped \
+  --audio-path-col audio_path \
+  --data-source-col data_source
+```
+
+## Outputs
+
+Each script writes:
+- a Hugging Face dataset directory (via `save_to_disk`)
+- `subset.jsonl` (if `--jsonl` is set)
+- a `*_meta.json` file with the filter settings and detected columns