Skip to content

Commit de2e095

Browse files
committed
Restructure task
1 parent b884e89 commit de2e095

22 files changed

Lines changed: 3607 additions & 352 deletions

README.md

Lines changed: 34 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -189,7 +189,7 @@ filtered = extractor.extract(
189189
print(filtered) # Only the relevant lines
190190
```
191191

192-
Both model types use the same `extract()` API. The generative model returns JSON (`{"relevant_lines": [...]}`), the encoder classifies each line directly. Both return filtered text.
192+
Both model types use the same `extract()` API. The generative model returns relevant lines in XML tags, the encoder classifies each line directly. Both return filtered text.
193193

194194
### Configuration
195195

@@ -247,7 +247,7 @@ Also works with other coding agents (Codex CLI, OpenCode, etc.) via their equiva
247247
python scripts/download_data.py
248248
```
249249

250-
This pulls the [SWE-bench tool output dataset](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) (7,148 train + 436 eval samples) from HuggingFace.
250+
This pulls the [tool output extraction dataset](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) (8,241 train + 252 dev + 557 test) from HuggingFace.
251251

252252
### 2a. Train generative model (Qwen + LoRA)
253253

@@ -262,8 +262,8 @@ Default: Qwen 3.5 2B with LoRA (r=16, alpha=32). See `configs/default.yaml` for
262262
### 2b. Train encoder model (mmBERT)
263263

264264
```bash
265-
# Prepare encoder-format data from the ChatML training data
266-
python scripts/prepare_encoder_data.py
265+
# Prepare encoder-format data from the downloaded splits
266+
python scripts/prepare_encoder_data.py --data-dir data
267267

268268
# Train the encoder
269269
python -m squeez.encoder.train \
@@ -295,51 +295,42 @@ Both produce the same metrics format (span F1, ROUGE-L, compression ratio) for d
295295

296296
Training data: [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench)
297297

298-
| | Count |
299-
|---|---|
300-
| Train samples | 7,148 |
301-
| Eval samples | 436 |
302-
| With relevant lines | 3,985 (53%) |
303-
| Empty (not relevant) | 3,599 (47%) |
304-
| Avg compression | 86% |
305-
306-
Built from 2,294 [SWE-bench](https://huggingface.co/datasets/princeton-nlp/SWE-bench) instances with real tool execution (git grep, git blame, pytest, ruff, etc.) against 12 repos. Teacher distillation by gpt-oss-120b on Groq.
307-
308-
### Tool types
309-
310-
| Tool Type | Count |
311-
|---|---|
312-
| read_file | 4,309 |
313-
| git_log | 840 |
314-
| grep | 575 |
315-
| build_output | 380 |
316-
| ls | 376 |
317-
| test_output | 344 |
318-
| python | 310 |
319-
| git_blame | 201 |
320-
| lint_output | 101 |
321-
| curl | 95 |
322-
| git_diff | 53 |
323-
324-
## How It Works
325-
326-
1. **Source**: SWE-bench test split (2,294 real GitHub issues)
327-
2. **Tool calls**: 3-7 synthetic tool calls per instance
328-
3. **Real execution**: All commands run against bare-cloned repos at the correct commit
329-
4. **Teacher distillation**: gpt-oss-120b selects relevant line ranges via JSON spans
330-
5. **Zero-hallucination extraction**: Teacher spans matched against original output — no generated text
331-
6. **Assembly**: Extracted lines formatted as `{"relevant_lines": [...]}` for SFT training
298+
| | Train | Dev | Test | Total |
299+
|---|---:|---:|---:|---:|
300+
| Samples | 8,241 | 252 | 557 | 9,050 |
301+
302+
Three data sources covering 30 tool types across multiple ecosystems:
303+
304+
- **SWE-bench real data** (5,936) — Real tool output from `git grep`, `pytest`, `pip install`, `mypy`, etc. executed on 2,294 cloned Python repos. Labeled by teacher LLM distillation with grounded line spans.
305+
- **Synthetic multi-ecosystem** (2,039) — LLM-generated tool output for npm, TypeScript, Rust, Go, Java, Docker, Terraform, kubectl, and more.
306+
- **Synthetic SWE-style** (1,075) — LLM-generated versions of Python tool types that had high noise rates in the real data.
307+
308+
Test set is manually curated. See the [dataset card](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) for full details on generation, filtering, and curation.
332309

333310
## Data Generation
334311

335312
To regenerate the dataset from scratch:
336313

337314
```bash
338-
squeez pipeline --phase 1 2 3 4 5 6 7 8 \
339-
--output-dir data \
340-
--github-token $GITHUB_TOKEN \
341-
--teacher-api-key $GROQ_API_KEY \
342-
--teacher-base-url https://api.groq.com/openai/v1
315+
# SWE-bench pipeline (requires cloned repos)
316+
python squeez/data/pipeline.py --phase 3 4 5 6 7 \
317+
--output-dir data/v2 \
318+
--teacher-base-url http://localhost:8000/v1 \
319+
--teacher-model openai/gpt-oss-120b
320+
321+
# Filter empty samples
322+
python scripts/filter_distilled.py data/v2/distilled_outputs.jsonl \
323+
--output data/v2/distilled_filtered.jsonl
324+
325+
# Synthetic multi-ecosystem
326+
python scripts/generate_synthetic_data.py \
327+
--output data/v2/synthetic_train.jsonl \
328+
--base-url http://localhost:8000/v1 \
329+
--model openai/gpt-oss-120b
330+
331+
# Merge and assemble
332+
cat data/v2/distilled_filtered.jsonl data/v2/synthetic_train.jsonl data/v2/synthetic_swe_style.jsonl > data/v2/distilled_outputs.jsonl
333+
python squeez/data/pipeline.py --phase 7 --output-dir data/v2
343334
```
344335

345336
## Citation

0 commit comments

Comments
 (0)