Skip to content

Add starcoder2data-extras download and tokenization#4599

Draft
Helw150 wants to merge 1 commit intomainfrom
add-starcoder2-extras
Draft

Add starcoder2data-extras download and tokenization#4599
Helw150 wants to merge 1 commit intomainfrom
add-starcoder2-extras

Conversation

@Helw150
Copy link
Copy Markdown
Member

@Helw150 Helw150 commented Apr 9, 2026

Adds download/reshard helpers for the bigcode/starcoder2data-extras subsets (ir_cpp, ir_python, ir_rust, ir_low_resource, documentation, kaggle) and an experiment script that tokenizes each subset with the marin tokenizer. ir_low_resource is resharded to even out its parquet files and given 80g worker RAM.

Adds download/reshard helpers for the bigcode/starcoder2data-extras
subsets (ir_cpp, ir_python, ir_rust, ir_low_resource, documentation,
kaggle) and an experiment script that tokenizes each subset with the
marin tokenizer. ir_low_resource is resharded to even out its parquet
files and given 80g worker RAM.
@Helw150 Helw150 added the agent-generated Created by automation/agent label Apr 9, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 9, 2026

Claude finished @Helw150's task in 3m 21s —— View job


Review of PR #4599

  • Read changed files and gather context
  • Review against existing codebase patterns
  • Provide review feedback

Summary

Nice addition for the starcoder2data-extras subsets. The download helper follows existing patterns well. I found two bugs that will prevent this from working at runtime, plus a few smaller issues. See inline comments for details and suggested fixes.

Bugs (must fix)

  1. default_tokenize doesn't accept worker_resources (inline) — The experiment script passes worker_resources to default_tokenize(), but that function doesn't have this parameter. This will TypeError at runtime. You'll need to construct TokenizeConfig + ExecutorStep directly (the pattern used by dolmino.py and nemotron_v2.py), which also lets you set worker_resources on the config.

  2. meta.serialized_size is metadata size, not data size (inline) — pq.read_metadata().serialized_size returns the Parquet footer metadata size (a few KB), not the actual file data size. This makes the resharding effectively a no-op — every file looks "small". Use sum(meta.row_group(i).total_byte_size for i in range(meta.num_row_groups)) instead.

Minor issues

  1. Missing hash_attrs for target_shard_mb (inline) — Changing the shard size won't invalidate the cache.

  2. Hardcoded gs:// prefix (inline) — The glob result reconstruction assumes GCS.

  3. Docstring missing "kaggle" (inline) — The experiment docstring lists subsets but omits "kaggle".

@Helw150 Helw150 marked this pull request as draft April 9, 2026 22:56
Comment on lines +33 to +39
default_tokenize(
name=f"starcoder2_extras/{subset}",
dataset=download.as_executor_step(),
tokenizer=tokenizer,
format=TextLmDatasetFormat(text_key="content"),
worker_resources=ResourceConfig(ram=ram, disk="10g"),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Bug: default_tokenize does not accept worker_resources

default_tokenize in experiments/defaults.py:201 has this signature:

def default_tokenize(
    name: str,
    dataset: InputName | ExecutorStep | str | HfDatasetSpec,
    tokenizer: str,
    format: LmDatasetFormatBase = TextLmDatasetFormat(),
    *,
    sample_count: int | VersionedValue[int] | None = None,
    is_validation: bool = False,
) -> ExecutorStep:

There is no worker_resources parameter — this will raise TypeError at runtime.

To set custom worker resources, you'll need to use TokenizeConfig + ExecutorStep directly (the pattern used in experiments/pretraining_datasets/dolmino.py and nemotron_v2.py), or pass worker_resources on the TokenizeConfig object:

from marin.processing.tokenize import TokenizeConfig, tokenize
from marin.execution.executor import ExecutorStep, this_output_path, versioned

step = ExecutorStep(
    name=f"tokenized/starcoder2_extras/{subset}",
    fn=tokenize,
    config=TokenizeConfig(
        train_paths=[download.as_executor_step()],
        validation_paths=versioned([]),
        cache_path=this_output_path(),
        tokenizer=versioned(tokenizer),
        format=TextLmDatasetFormat(text_key="content"),
        worker_resources=ResourceConfig(ram=ram, disk="10g"),
    ),
)

Comment on lines +49 to +50
meta = pq.read_metadata(file_path)
if meta.serialized_size <= target_bytes:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Bug: serialized_size is the Parquet metadata size, not the file/data size

pq.read_metadata(file_path).serialized_size returns the serialized size of the Parquet metadata footer (typically a few KB), not the total data size. This means:

  • meta.serialized_size <= target_bytes will almost always be True, so files are never split
  • The rows_per_shard calculation on line 60 divides by a tiny number, producing huge shards

To get the actual data size, sum the row group sizes:

Suggested change
meta = pq.read_metadata(file_path)
if meta.serialized_size <= target_bytes:
data_size = sum(meta.row_group(i).total_byte_size for i in range(meta.num_row_groups))
if data_size <= target_bytes:

And update line 60 similarly:

rows_per_shard = max(1, (table.num_rows * target_bytes) // data_size)

logger = logging.getLogger(__name__)
input_path = raw_output_path
fs, _ = url_to_fs(input_path)
files = sorted(f"gs://{f}" for f in fs.glob(f"{input_path}/**/*.parquet") if not f.endswith("/.parquet"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Nit: hardcoded gs:// prefix

fs.glob() returns paths without the protocol prefix, and this re-adds gs://. This couples the code to GCS. Consider deriving the prefix from input_path instead, e.g.:

from urllib.parse import urlparse
scheme = urlparse(input_path).scheme
prefix = f"{scheme}://" if scheme else ""
files = sorted(f"{prefix}{f}" for f in fs.glob(...) if not f.endswith("/.parquet"))

Comment on lines +75 to +79
return StepSpec(
name=f"resharded/starcoder2_extras/{subset}",
fn=_run,
deps=[raw],
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Missing hash_attrs for target_shard_mb

The StepSpec doesn't include target_shard_mb in hash_attrs. If someone changes the shard size parameter, the cached output won't be invalidated.

Suggested change
return StepSpec(
name=f"resharded/starcoder2_extras/{subset}",
fn=_run,
deps=[raw],
)
return StepSpec(
name=f"resharded/starcoder2_extras/{subset}",
fn=_run,
deps=[raw],
hash_attrs={"target_shard_mb": target_shard_mb},
)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f048673486

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

dataset=download.as_executor_step(),
tokenizer=tokenizer,
format=TextLmDatasetFormat(text_key="content"),
worker_resources=ResourceConfig(ram=ram, disk="10g"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove unsupported worker_resources kwarg

default_tokenize does not define a worker_resources parameter, so this call path will fail immediately with TypeError: default_tokenize() got an unexpected keyword argument 'worker_resources' when tokenize_starcoder2_extras() is executed. That means the new StarCoder2 extras experiment cannot run at all.

Useful? React with 👍 / 👎.

Comment on lines +49 to +50
meta = pq.read_metadata(file_path)
if meta.serialized_size <= target_bytes:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use actual parquet file size for shard splitting

This logic uses meta.serialized_size to decide whether to split a parquet file, but that field is the serialized metadata/footer size, not the data file size. As a result, large files are typically treated as "small" and copied without splitting, so the reshard step does not actually rebalance oversized shards as intended.

Useful? React with 👍 / 👎.

# Copyright The Marin Authors
# SPDX-License-Identifier: Apache-2.0

"""StarCoder2 data extras: download and tokenize ir_cpp, ir_python, ir_rust, ir_low_resource, documentation."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Nit: docstring missing "kaggle"

The docstring lists subsets but omits "kaggle", which is in SUBSETS.

Suggested change
"""StarCoder2 data extras: download and tokenize ir_cpp, ir_python, ir_rust, ir_low_resource, documentation."""
"""StarCoder2 data extras: download and tokenize ir_cpp, ir_python, ir_rust, ir_low_resource, documentation, kaggle."""

@ravwojdyla
Copy link
Copy Markdown
Contributor

I'm stealing this PR --> #4626

monkey-steal-gif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants