🤖 Standalone data-ingestion task.
Description
Source and prepare GH Archive event payloads as pretraining data. These records cover real GitHub-shaped JSON: event metadata, actor/repo fields, issue and PR payloads, timestamps, URLs, IDs, and nested tool/API objects.
Primary source:
This should be license/governance reviewed before inclusion. It should also preserve a date-based held-out split for eval and avoid training on any GH Archive eval slice.
Definition of Done
- Document license/governance status and any redistribution constraints.
- Build a downloader/normalizer for selected date ranges and event types.
- Define train/dev/test date splits before ingest.
- Mask or bucket fields that would dominate without useful learning, such as long hashes when appropriate.
- Propose mixture weights and byte/token estimates.
🤖 Standalone data-ingestion task.
Description
Source and prepare GH Archive event payloads as pretraining data. These records cover real GitHub-shaped JSON: event metadata, actor/repo fields, issue and PR payloads, timestamps, URLs, IDs, and nested tool/API objects.
Primary source:
This should be license/governance reviewed before inclusion. It should also preserve a date-based held-out split for eval and avoid training on any GH Archive eval slice.
Definition of Done