Skip to content

Latest commit

 

History

History
129 lines (88 loc) · 3.78 KB

File metadata and controls

129 lines (88 loc) · 3.78 KB

Agent Instructions — DataScience F# Dataset Pipeline

This repo is a dataset-engineering workspace for turning a messy set of ~53 external “repo folders” into clean, reproducible corpora for fine-tuning NVIDIA Nemotron 3 Nano into an F# coding assistant.

Current phase: scaffolding (what exists today)

We have three working scripts that establish the “measure → snapshot → corpus” backbone:

  1. Inventory
  • scripts/inventory-fsharp-repos.ps1
  • Writes:
    • data/interim/repo_inventory.json
    • reports/inventory.md
  1. Snapshot export (canonical clean source tree)
  • scripts/export-fsharp-snapshot.ps1
  • Writes:
    • data/raw/fsharp_snapshot/
    • data/interim/fsharp_snapshot_manifest.json
  1. JSONL extraction (training-friendly corpora)
  • scripts/extract-fsharp-jsonl.ps1
  • Writes:
    • data/interim/corpus_code.jsonl
    • data/interim/corpus_docs.jsonl

The “chronicle / handoff” lives in README.md.

Ground rules (non-negotiables)

  • External source-of-truth for raw repos: A:/Repos/F# Repos
    • We do not modify this in-place unless explicitly asked.
  • Local artifacts:
    • data/ is for big generated artifacts.
    • reports/ is for small commit-worthy summaries.
  • Always dry-run first for anything that copies/writes a lot:
    • inventory: -WhatIf
    • snapshot export: -DryRun
    • JSONL extraction: -DryRun
  • Canonical clean input for downstream steps is the exported snapshot:
    • data/raw/fsharp_snapshot/

Standard workflow (PowerShell 7)

1) Inventory (measure first)

pwsh -NoProfile -ExecutionPolicy Bypass -File .\scripts\inventory-fsharp-repos.ps1 \
  -ReposRoot "A:/Repos/F# Repos"

2) Export clean snapshot

Dry run:

pwsh -NoProfile -ExecutionPolicy Bypass -File .\scripts\export-fsharp-snapshot.ps1 \
  -ReposRoot "A:/Repos/F# Repos" -DryRun

Real export:

pwsh -NoProfile -ExecutionPolicy Bypass -File .\scripts\export-fsharp-snapshot.ps1 \
  -ReposRoot "A:/Repos/F# Repos" \
  -ExportRoot "data/raw/fsharp_snapshot" \
  -OutManifest "data/interim/fsharp_snapshot_manifest.json"

3) Extract JSONL corpora

Dry run:

pwsh -NoProfile -ExecutionPolicy Bypass -File .\scripts\extract-fsharp-jsonl.ps1 \
  -SnapshotRoot "data/raw/fsharp_snapshot" -DryRun

Real extraction:

pwsh -NoProfile -ExecutionPolicy Bypass -File .\scripts\extract-fsharp-jsonl.ps1 \
  -SnapshotRoot "data/raw/fsharp_snapshot" \
  -OutCodeJsonl "data/interim/corpus_code.jsonl" \
  -OutDocsJsonl "data/interim/corpus_docs.jsonl"

Near-term roadmap (next milestones)

  • Normalize: deterministic normalization (LF, encoding, whitespace) + optional Fantomas formatting.
  • Deduplicate: exact + near-dup with audit reports.
  • Chunk: split long files into training chunks with provenance spans.
  • Split by repo: train/valid/test by repo to prevent leakage.
  • Verification gates: compile/test/FSI where possible.
  • SFT synthesis (later): generator → judge → repair (F#-specific tasks).

Agent Operating Rules

ExecPlans

When writing complex features or significant refactors, use an ExecPlan (as described in .agent/PLANS.md) from design to implementation.

Project Memory System

Memory-Aware Protocols

Follow these by default:

  1. Architectural changes:
  • Read docs/project_notes/decisions.md first.
  • If you want to contradict an ADR, explain why and propose an ADR revision.
  1. Bugs/errors:
  • Search docs/project_notes/bugs.md first.
  • If new, add an entry once fixed (include Prevention).
  1. Configuration/constants:
  • Use docs/project_notes/key_facts.md as source of truth.
  • Do NOT invent ports/URLs/paths. Look them up.
  1. Work tracking:
  • Use docs/project_notes/issues.md as a lightweight work log with dates + ticket links.