Skip to content

Commit 0c38907

Browse files
ravwojdylaclaude
andcommitted
testbed: target 1T tokens for the by-provenance sample
Raises ``RAW_TARGET_TOTAL_TOKENS_B`` (and the baseline / variants per-run ``TARGET_TOTAL_TOKENS_B`` constants that feed ``build_testbed_steps``) from 10 B to 1000 B — the RFC-canonical 1 T testbed. Drops the stale ``# TODO(rav): update this to 1T`` comment off the settings default. ``scripts/datakit/run_source_sampling.py`` stays at 10 B because it's an explicit smoke tool, not a production entrypoint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent cfe0c63 commit 0c38907

3 files changed

Lines changed: 3 additions & 4 deletions

File tree

experiments/datakit_testbed/baseline.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
logger = logging.getLogger(__name__)
3939

4040
STAGING_PREFIX = "gs://marin-us-central1"
41-
TARGET_TOTAL_TOKENS_B = 10.0
41+
TARGET_TOTAL_TOKENS_B = 1000.0
4242

4343
_SAMPLE_STEP_PREFIX = "data/datakit/"
4444

experiments/datakit_testbed/settings.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,7 @@
2323
source in the registry must either be pre-staged there or downloadable into it.
2424
"""
2525

26-
# TODO(rav): update this to 1T
27-
RAW_TARGET_TOTAL_TOKENS_B: float = 10.0
26+
RAW_TARGET_TOTAL_TOKENS_B: float = 1000.0
2827
"""Target size (billions of tokens) for the pre-normalize by-provenance sample.
2928
3029
Drives per-source sampling fractions via

experiments/datakit_testbed/variants.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939
logger = logging.getLogger(__name__)
4040

4141
STAGING_PREFIX = "gs://marin-us-central1"
42-
TARGET_TOTAL_TOKENS_B = 10.0
42+
TARGET_TOTAL_TOKENS_B = 1000.0
4343

4444
_SAMPLE_STEP_PREFIX = "data/datakit/"
4545
_FUZZY_DUPS_MAX_PARALLELISM = 128

0 commit comments

Comments
 (0)