Add starcoder2data-extras download and tokenization by Helw150 · Pull Request #4599 · marin-community/marin

Helw150 · 2026-04-09T22:55:09Z

Adds download/reshard helpers for the bigcode/starcoder2data-extras subsets (ir_cpp, ir_python, ir_rust, ir_low_resource, documentation, kaggle) and an experiment script that tokenizes each subset with the marin tokenizer. ir_low_resource is resharded to even out its parquet files and given 80g worker RAM.

claude · 2026-04-09T22:55:23Z

Claude finished @Helw150's task in 3m 21s —— View job

Review of PR #4599

Read changed files and gather context
Review against existing codebase patterns
Provide review feedback

Summary

Nice addition for the starcoder2data-extras subsets. The download helper follows existing patterns well. I found two bugs that will prevent this from working at runtime, plus a few smaller issues. See inline comments for details and suggested fixes.

Bugs (must fix)

default_tokenize doesn't accept worker_resources (inline) — The experiment script passes worker_resources to default_tokenize(), but that function doesn't have this parameter. This will TypeError at runtime. You'll need to construct TokenizeConfig + ExecutorStep directly (the pattern used by dolmino.py and nemotron_v2.py), which also lets you set worker_resources on the config.
meta.serialized_size is metadata size, not data size (inline) — pq.read_metadata().serialized_size returns the Parquet footer metadata size (a few KB), not the actual file data size. This makes the resharding effectively a no-op — every file looks "small". Use sum(meta.row_group(i).total_byte_size for i in range(meta.num_row_groups)) instead.

Minor issues

Missing hash_attrs for target_shard_mb (inline) — Changing the shard size won't invalidate the cache.
Hardcoded gs:// prefix (inline) — The glob result reconstruction assumes GCS.
Docstring missing "kaggle" (inline) — The experiment docstring lists subsets but omits "kaggle".

claude · 2026-04-09T22:58:23Z

experiments/pretraining_datasets/starcoder2_extras.py

+            default_tokenize(
+                name=f"starcoder2_extras/{subset}",
+                dataset=download.as_executor_step(),
+                tokenizer=tokenizer,
+                format=TextLmDatasetFormat(text_key="content"),
+                worker_resources=ResourceConfig(ram=ram, disk="10g"),
+            )


🤖 Bug: default_tokenize does not accept worker_resources

default_tokenize in experiments/defaults.py:201 has this signature:

def default_tokenize( name: str, dataset: InputName | ExecutorStep | str | HfDatasetSpec, tokenizer: str, format: LmDatasetFormatBase = TextLmDatasetFormat(), *, sample_count: int | VersionedValue[int] | None = None, is_validation: bool = False, ) -> ExecutorStep:

There is no worker_resources parameter — this will raise TypeError at runtime.

To set custom worker resources, you'll need to use TokenizeConfig + ExecutorStep directly (the pattern used in experiments/pretraining_datasets/dolmino.py and nemotron_v2.py), or pass worker_resources on the TokenizeConfig object:

from marin.processing.tokenize import TokenizeConfig, tokenize from marin.execution.executor import ExecutorStep, this_output_path, versioned step = ExecutorStep( name=f"tokenized/starcoder2_extras/{subset}", fn=tokenize, config=TokenizeConfig( train_paths=[download.as_executor_step()], validation_paths=versioned([]), cache_path=this_output_path(), tokenizer=versioned(tokenizer), format=TextLmDatasetFormat(text_key="content"), worker_resources=ResourceConfig(ram=ram, disk="10g"), ), )

claude · 2026-04-09T22:58:27Z

lib/marin/src/marin/datakit/download/starcoder2_extras.py

+            meta = pq.read_metadata(file_path)
+            if meta.serialized_size <= target_bytes:


🤖 Bug: serialized_size is the Parquet metadata size, not the file/data size

pq.read_metadata(file_path).serialized_size returns the serialized size of the Parquet metadata footer (typically a few KB), not the total data size. This means:

meta.serialized_size <= target_bytes will almost always be True, so files are never split

The rows_per_shard calculation on line 60 divides by a tiny number, producing huge shards

To get the actual data size, sum the row group sizes:

Suggested change

meta = pq.read_metadata(file_path)

if meta.serialized_size <= target_bytes:

data_size = sum(meta.row_group(i).total_byte_size for i in range(meta.num_row_groups))

if data_size <= target_bytes:

And update line 60 similarly:

rows_per_shard = max(1, (table.num_rows * target_bytes) // data_size)

claude · 2026-04-09T22:58:31Z

lib/marin/src/marin/datakit/download/starcoder2_extras.py

+        logger = logging.getLogger(__name__)
+        input_path = raw_output_path
+        fs, _ = url_to_fs(input_path)
+        files = sorted(f"gs://{f}" for f in fs.glob(f"{input_path}/**/*.parquet") if not f.endswith("/.parquet"))


🤖 Nit: hardcoded gs:// prefix

fs.glob() returns paths without the protocol prefix, and this re-adds gs://. This couples the code to GCS. Consider deriving the prefix from input_path instead, e.g.:

from urllib.parse import urlparse scheme = urlparse(input_path).scheme prefix = f"{scheme}://" if scheme else "" files = sorted(f"{prefix}{f}" for f in fs.glob(...) if not f.endswith("/.parquet"))

claude · 2026-04-09T22:58:34Z

lib/marin/src/marin/datakit/download/starcoder2_extras.py

+    return StepSpec(
+        name=f"resharded/starcoder2_extras/{subset}",
+        fn=_run,
+        deps=[raw],
+    )


🤖 Missing hash_attrs for target_shard_mb

The StepSpec doesn't include target_shard_mb in hash_attrs. If someone changes the shard size parameter, the cached output won't be invalidated.

Suggested change

return StepSpec(

name=f"resharded/starcoder2_extras/{subset}",

fn=_run,

deps=[raw],

)

return StepSpec(

name=f"resharded/starcoder2_extras/{subset}",

fn=_run,

deps=[raw],

hash_attrs={"target_shard_mb": target_shard_mb},

)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f048673486

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-09T22:58:35Z

experiments/pretraining_datasets/starcoder2_extras.py

+                dataset=download.as_executor_step(),
+                tokenizer=tokenizer,
+                format=TextLmDatasetFormat(text_key="content"),
+                worker_resources=ResourceConfig(ram=ram, disk="10g"),


Remove unsupported worker_resources kwarg

default_tokenize does not define a worker_resources parameter, so this call path will fail immediately with TypeError: default_tokenize() got an unexpected keyword argument 'worker_resources' when tokenize_starcoder2_extras() is executed. That means the new StarCoder2 extras experiment cannot run at all.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-09T22:58:36Z

lib/marin/src/marin/datakit/download/starcoder2_extras.py

+            meta = pq.read_metadata(file_path)
+            if meta.serialized_size <= target_bytes:


Use actual parquet file size for shard splitting

This logic uses meta.serialized_size to decide whether to split a parquet file, but that field is the serialized metadata/footer size, not the data file size. As a result, large files are typically treated as "small" and copied without splitting, so the reshard step does not actually rebalance oversized shards as intended.

Useful? React with 👍 / 👎.

claude · 2026-04-09T22:58:36Z

experiments/pretraining_datasets/starcoder2_extras.py

+# Copyright The Marin Authors
+# SPDX-License-Identifier: Apache-2.0
+
+"""StarCoder2 data extras: download and tokenize ir_cpp, ir_python, ir_rust, ir_low_resource, documentation."""


🤖 Nit: docstring missing "kaggle"

The docstring lists subsets but omits "kaggle", which is in SUBSETS.

Suggested change

"""StarCoder2 data extras: download and tokenize ir_cpp, ir_python, ir_rust, ir_low_resource, documentation."""

"""StarCoder2 data extras: download and tokenize ir_cpp, ir_python, ir_rust, ir_low_resource, documentation, kaggle."""

ravwojdyla · 2026-04-10T18:14:25Z

I'm stealing this PR --> #4626

Helw150 added the agent-generated Created by automation/agent label Apr 9, 2026

Helw150 marked this pull request as draft April 9, 2026 22:56

claude bot reviewed Apr 9, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

claude bot reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add starcoder2data-extras download and tokenization#4599

Add starcoder2data-extras download and tokenization#4599
Helw150 wants to merge 1 commit intomainfrom
add-starcoder2-extras

Helw150 commented Apr 9, 2026

Uh oh!

claude bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

claude bot Apr 9, 2026

Uh oh!

claude bot Apr 9, 2026

Uh oh!

claude bot Apr 9, 2026

Uh oh!

claude bot Apr 9, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 9, 2026

Uh oh!

chatgpt-codex-connector bot Apr 9, 2026

Uh oh!

claude bot Apr 9, 2026

Uh oh!

ravwojdyla commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		meta = pq.read_metadata(file_path)
		if meta.serialized_size <= target_bytes:

	"""StarCoder2 data extras: download and tokenize ir_cpp, ir_python, ir_rust, ir_low_resource, documentation."""
	"""StarCoder2 data extras: download and tokenize ir_cpp, ir_python, ir_rust, ir_low_resource, documentation, kaggle."""

Conversation

Helw150 commented Apr 9, 2026

Uh oh!

claude bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #4599

Summary

Bugs (must fix)

Minor issues

Uh oh!

claude bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

ravwojdyla commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Apr 9, 2026 •

edited

Loading