This repo is a dataset-engineering workspace for turning a messy set of ~53 external “repo folders” into clean, reproducible corpora for fine-tuning NVIDIA Nemotron 3 Nano into an F# coding assistant.
We have three working scripts that establish the “measure → snapshot → corpus” backbone:
- Inventory
scripts/inventory-fsharp-repos.ps1- Writes:
data/interim/repo_inventory.jsonreports/inventory.md
- Snapshot export (canonical clean source tree)
scripts/export-fsharp-snapshot.ps1- Writes:
data/raw/fsharp_snapshot/data/interim/fsharp_snapshot_manifest.json
- JSONL extraction (training-friendly corpora)
scripts/extract-fsharp-jsonl.ps1- Writes:
data/interim/corpus_code.jsonldata/interim/corpus_docs.jsonl
The “chronicle / handoff” lives in README.md.
- External source-of-truth for raw repos:
A:/Repos/F# Repos- We do not modify this in-place unless explicitly asked.
- Local artifacts:
data/is for big generated artifacts.reports/is for small commit-worthy summaries.
- Always dry-run first for anything that copies/writes a lot:
- inventory:
-WhatIf - snapshot export:
-DryRun - JSONL extraction:
-DryRun
- inventory:
- Canonical clean input for downstream steps is the exported snapshot:
data/raw/fsharp_snapshot/
pwsh -NoProfile -ExecutionPolicy Bypass -File .\scripts\inventory-fsharp-repos.ps1 \
-ReposRoot "A:/Repos/F# Repos"Dry run:
pwsh -NoProfile -ExecutionPolicy Bypass -File .\scripts\export-fsharp-snapshot.ps1 \
-ReposRoot "A:/Repos/F# Repos" -DryRunReal export:
pwsh -NoProfile -ExecutionPolicy Bypass -File .\scripts\export-fsharp-snapshot.ps1 \
-ReposRoot "A:/Repos/F# Repos" \
-ExportRoot "data/raw/fsharp_snapshot" \
-OutManifest "data/interim/fsharp_snapshot_manifest.json"Dry run:
pwsh -NoProfile -ExecutionPolicy Bypass -File .\scripts\extract-fsharp-jsonl.ps1 \
-SnapshotRoot "data/raw/fsharp_snapshot" -DryRunReal extraction:
pwsh -NoProfile -ExecutionPolicy Bypass -File .\scripts\extract-fsharp-jsonl.ps1 \
-SnapshotRoot "data/raw/fsharp_snapshot" \
-OutCodeJsonl "data/interim/corpus_code.jsonl" \
-OutDocsJsonl "data/interim/corpus_docs.jsonl"- Normalize: deterministic normalization (LF, encoding, whitespace) + optional Fantomas formatting.
- Deduplicate: exact + near-dup with audit reports.
- Chunk: split long files into training chunks with provenance spans.
- Split by repo: train/valid/test by repo to prevent leakage.
- Verification gates: compile/test/FSI where possible.
- SFT synthesis (later): generator → judge → repair (F#-specific tasks).
When writing complex features or significant refactors, use an ExecPlan (as described in .agent/PLANS.md) from design to implementation.
Follow these by default:
- Architectural changes:
- Read
docs/project_notes/decisions.mdfirst. - If you want to contradict an ADR, explain why and propose an ADR revision.
- Bugs/errors:
- Search
docs/project_notes/bugs.mdfirst. - If new, add an entry once fixed (include Prevention).
- Configuration/constants:
- Use
docs/project_notes/key_facts.mdas source of truth. - Do NOT invent ports/URLs/paths. Look them up.
- Work tracking:
- Use
docs/project_notes/issues.mdas a lightweight work log with dates + ticket links.