You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+34-43Lines changed: 34 additions & 43 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -189,7 +189,7 @@ filtered = extractor.extract(
189
189
print(filtered) # Only the relevant lines
190
190
```
191
191
192
-
Both model types use the same `extract()` API. The generative model returns JSON (`{"relevant_lines": [...]}`), the encoder classifies each line directly. Both return filtered text.
192
+
Both model types use the same `extract()` API. The generative model returns relevant lines in XML tags, the encoder classifies each line directly. Both return filtered text.
193
193
194
194
### Configuration
195
195
@@ -247,7 +247,7 @@ Also works with other coding agents (Codex CLI, OpenCode, etc.) via their equiva
247
247
python scripts/download_data.py
248
248
```
249
249
250
-
This pulls the [SWE-bench tool output dataset](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) (7,148 train + 436 eval samples) from HuggingFace.
250
+
This pulls the [tool output extraction dataset](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) (8,241 train + 252 dev + 557 test) from HuggingFace.
251
251
252
252
### 2a. Train generative model (Qwen + LoRA)
253
253
@@ -262,8 +262,8 @@ Default: Qwen 3.5 2B with LoRA (r=16, alpha=32). See `configs/default.yaml` for
262
262
### 2b. Train encoder model (mmBERT)
263
263
264
264
```bash
265
-
# Prepare encoder-format data from the ChatML training data
266
-
python scripts/prepare_encoder_data.py
265
+
# Prepare encoder-format data from the downloaded splits
266
+
python scripts/prepare_encoder_data.py --data-dir data
267
267
268
268
# Train the encoder
269
269
python -m squeez.encoder.train \
@@ -295,51 +295,42 @@ Both produce the same metrics format (span F1, ROUGE-L, compression ratio) for d
295
295
296
296
Training data: [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench)
297
297
298
-
|| Count |
299
-
|---|---|
300
-
| Train samples | 7,148 |
301
-
| Eval samples | 436 |
302
-
| With relevant lines | 3,985 (53%) |
303
-
| Empty (not relevant) | 3,599 (47%) |
304
-
| Avg compression | 86% |
305
-
306
-
Built from 2,294 [SWE-bench](https://huggingface.co/datasets/princeton-nlp/SWE-bench) instances with real tool execution (git grep, git blame, pytest, ruff, etc.) against 12 repos. Teacher distillation by gpt-oss-120b on Groq.
307
-
308
-
### Tool types
309
-
310
-
| Tool Type | Count |
311
-
|---|---|
312
-
| read_file | 4,309 |
313
-
| git_log | 840 |
314
-
| grep | 575 |
315
-
| build_output | 380 |
316
-
| ls | 376 |
317
-
| test_output | 344 |
318
-
| python | 310 |
319
-
| git_blame | 201 |
320
-
| lint_output | 101 |
321
-
| curl | 95 |
322
-
| git_diff | 53 |
323
-
324
-
## How It Works
325
-
326
-
1.**Source**: SWE-bench test split (2,294 real GitHub issues)
327
-
2.**Tool calls**: 3-7 synthetic tool calls per instance
328
-
3.**Real execution**: All commands run against bare-cloned repos at the correct commit
329
-
4.**Teacher distillation**: gpt-oss-120b selects relevant line ranges via JSON spans
330
-
5.**Zero-hallucination extraction**: Teacher spans matched against original output — no generated text
331
-
6.**Assembly**: Extracted lines formatted as `{"relevant_lines": [...]}` for SFT training
298
+
|| Train | Dev | Test | Total |
299
+
|---|---:|---:|---:|---:|
300
+
| Samples | 8,241 | 252 | 557 | 9,050 |
301
+
302
+
Three data sources covering 30 tool types across multiple ecosystems:
303
+
304
+
-**SWE-bench real data** (5,936) — Real tool output from `git grep`, `pytest`, `pip install`, `mypy`, etc. executed on 2,294 cloned Python repos. Labeled by teacher LLM distillation with grounded line spans.
305
+
-**Synthetic multi-ecosystem** (2,039) — LLM-generated tool output for npm, TypeScript, Rust, Go, Java, Docker, Terraform, kubectl, and more.
306
+
-**Synthetic SWE-style** (1,075) — LLM-generated versions of Python tool types that had high noise rates in the real data.
307
+
308
+
Test set is manually curated. See the [dataset card](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) for full details on generation, filtering, and curation.
0 commit comments