Agent Instructions — DataScience F# Dataset Pipeline

This repo is a dataset-engineering workspace for turning a messy set of ~53 external “repo folders” into clean, reproducible corpora for fine-tuning NVIDIA Nemotron 3 Nano into an F# coding assistant.

Current phase: scaffolding (what exists today)

We have three working scripts that establish the “measure → snapshot → corpus” backbone:

Inventory

scripts/inventory-fsharp-repos.ps1
Writes:
- data/interim/repo_inventory.json
- reports/inventory.md

Snapshot export (canonical clean source tree)

scripts/export-fsharp-snapshot.ps1
Writes:
- data/raw/fsharp_snapshot/
- data/interim/fsharp_snapshot_manifest.json

JSONL extraction (training-friendly corpora)

scripts/extract-fsharp-jsonl.ps1
Writes:
- data/interim/corpus_code.jsonl
- data/interim/corpus_docs.jsonl

The “chronicle / handoff” lives in README.md.

Ground rules (non-negotiables)

External source-of-truth for raw repos: A:/Repos/F# Repos
- We do not modify this in-place unless explicitly asked.
Local artifacts:
- data/ is for big generated artifacts.
- reports/ is for small commit-worthy summaries.
Always dry-run first for anything that copies/writes a lot:
- inventory: -WhatIf
- snapshot export: -DryRun
- JSONL extraction: -DryRun
Canonical clean input for downstream steps is the exported snapshot:
- data/raw/fsharp_snapshot/

Standard workflow (PowerShell 7)

1) Inventory (measure first)

pwsh -NoProfile -ExecutionPolicy Bypass -File .\scripts\inventory-fsharp-repos.ps1 \
  -ReposRoot "A:/Repos/F# Repos"

2) Export clean snapshot

Dry run:

pwsh -NoProfile -ExecutionPolicy Bypass -File .\scripts\export-fsharp-snapshot.ps1 \
  -ReposRoot "A:/Repos/F# Repos" -DryRun

Real export:

pwsh -NoProfile -ExecutionPolicy Bypass -File .\scripts\export-fsharp-snapshot.ps1 \
  -ReposRoot "A:/Repos/F# Repos" \
  -ExportRoot "data/raw/fsharp_snapshot" \
  -OutManifest "data/interim/fsharp_snapshot_manifest.json"

3) Extract JSONL corpora

Dry run:

pwsh -NoProfile -ExecutionPolicy Bypass -File .\scripts\extract-fsharp-jsonl.ps1 \
  -SnapshotRoot "data/raw/fsharp_snapshot" -DryRun

Real extraction:

pwsh -NoProfile -ExecutionPolicy Bypass -File .\scripts\extract-fsharp-jsonl.ps1 \
  -SnapshotRoot "data/raw/fsharp_snapshot" \
  -OutCodeJsonl "data/interim/corpus_code.jsonl" \
  -OutDocsJsonl "data/interim/corpus_docs.jsonl"

Near-term roadmap (next milestones)

Normalize: deterministic normalization (LF, encoding, whitespace) + optional Fantomas formatting.
Deduplicate: exact + near-dup with audit reports.
Chunk: split long files into training chunks with provenance spans.
Split by repo: train/valid/test by repo to prevent leakage.
Verification gates: compile/test/FSI where possible.
SFT synthesis (later): generator → judge → repair (F#-specific tasks).

Agent Operating Rules

ExecPlans

When writing complex features or significant refactors, use an ExecPlan (as described in .agent/PLANS.md) from design to implementation.

Project Memory System

Memory-Aware Protocols

Follow these by default:

Architectural changes:

Read docs/project_notes/decisions.md first.
If you want to contradict an ADR, explain why and propose an ADR revision.

Bugs/errors:

Search docs/project_notes/bugs.md first.
If new, add an entry once fixed (include Prevention).

Configuration/constants:

Use docs/project_notes/key_facts.md as source of truth.
Do NOT invent ports/URLs/paths. Look them up.

Work tracking:

Use docs/project_notes/issues.md as a lightweight work log with dates + ticket links.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Instructions — DataScience F# Dataset Pipeline

Current phase: scaffolding (what exists today)

Ground rules (non-negotiables)

Standard workflow (PowerShell 7)

1) Inventory (measure first)

2) Export clean snapshot

3) Extract JSONL corpora

Near-term roadmap (next milestones)

Agent Operating Rules

ExecPlans

Project Memory System

Memory-Aware Protocols

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

Agent Instructions — DataScience F# Dataset Pipeline

Current phase: scaffolding (what exists today)

Ground rules (non-negotiables)

Standard workflow (PowerShell 7)

1) Inventory (measure first)

2) Export clean snapshot

3) Extract JSONL corpora

Near-term roadmap (next milestones)

Agent Operating Rules

ExecPlans

Project Memory System

Memory-Aware Protocols