Skip to content

[data] Add GH Archive JSON events to pretraining data #5099

@dlwh

Description

@dlwh

🤖 Standalone data-ingestion task.

Description

Source and prepare GH Archive event payloads as pretraining data. These records cover real GitHub-shaped JSON: event metadata, actor/repo fields, issue and PR payloads, timestamps, URLs, IDs, and nested tool/API objects.

Primary source:

This should be license/governance reviewed before inclusion. It should also preserve a date-based held-out split for eval and avoid training on any GH Archive eval slice.

Definition of Done

  • Document license/governance status and any redistribution constraints.
  • Build a downloader/normalizer for selected date ranges and event types.
  • Define train/dev/test date splits before ingest.
  • Mask or bucket fields that would dominate without useful learning, such as long hashes when appropriate.
  • Propose mixture weights and byte/token estimates.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions