Add starcoder2data-extras download and tokenization#4599
Add starcoder2data-extras download and tokenization#4599
Conversation
Adds download/reshard helpers for the bigcode/starcoder2data-extras subsets (ir_cpp, ir_python, ir_rust, ir_low_resource, documentation, kaggle) and an experiment script that tokenizes each subset with the marin tokenizer. ir_low_resource is resharded to even out its parquet files and given 80g worker RAM.
|
Claude finished @Helw150's task in 3m 21s —— View job Review of PR #4599
SummaryNice addition for the starcoder2data-extras subsets. The download helper follows existing patterns well. I found two bugs that will prevent this from working at runtime, plus a few smaller issues. See inline comments for details and suggested fixes. Bugs (must fix)
Minor issues |
| default_tokenize( | ||
| name=f"starcoder2_extras/{subset}", | ||
| dataset=download.as_executor_step(), | ||
| tokenizer=tokenizer, | ||
| format=TextLmDatasetFormat(text_key="content"), | ||
| worker_resources=ResourceConfig(ram=ram, disk="10g"), | ||
| ) |
There was a problem hiding this comment.
🤖 Bug: default_tokenize does not accept worker_resources
default_tokenize in experiments/defaults.py:201 has this signature:
def default_tokenize(
name: str,
dataset: InputName | ExecutorStep | str | HfDatasetSpec,
tokenizer: str,
format: LmDatasetFormatBase = TextLmDatasetFormat(),
*,
sample_count: int | VersionedValue[int] | None = None,
is_validation: bool = False,
) -> ExecutorStep:There is no worker_resources parameter — this will raise TypeError at runtime.
To set custom worker resources, you'll need to use TokenizeConfig + ExecutorStep directly (the pattern used in experiments/pretraining_datasets/dolmino.py and nemotron_v2.py), or pass worker_resources on the TokenizeConfig object:
from marin.processing.tokenize import TokenizeConfig, tokenize
from marin.execution.executor import ExecutorStep, this_output_path, versioned
step = ExecutorStep(
name=f"tokenized/starcoder2_extras/{subset}",
fn=tokenize,
config=TokenizeConfig(
train_paths=[download.as_executor_step()],
validation_paths=versioned([]),
cache_path=this_output_path(),
tokenizer=versioned(tokenizer),
format=TextLmDatasetFormat(text_key="content"),
worker_resources=ResourceConfig(ram=ram, disk="10g"),
),
)| meta = pq.read_metadata(file_path) | ||
| if meta.serialized_size <= target_bytes: |
There was a problem hiding this comment.
🤖 Bug: serialized_size is the Parquet metadata size, not the file/data size
pq.read_metadata(file_path).serialized_size returns the serialized size of the Parquet metadata footer (typically a few KB), not the total data size. This means:
meta.serialized_size <= target_byteswill almost always beTrue, so files are never split- The
rows_per_shardcalculation on line 60 divides by a tiny number, producing huge shards
To get the actual data size, sum the row group sizes:
| meta = pq.read_metadata(file_path) | |
| if meta.serialized_size <= target_bytes: | |
| data_size = sum(meta.row_group(i).total_byte_size for i in range(meta.num_row_groups)) | |
| if data_size <= target_bytes: |
And update line 60 similarly:
rows_per_shard = max(1, (table.num_rows * target_bytes) // data_size)| logger = logging.getLogger(__name__) | ||
| input_path = raw_output_path | ||
| fs, _ = url_to_fs(input_path) | ||
| files = sorted(f"gs://{f}" for f in fs.glob(f"{input_path}/**/*.parquet") if not f.endswith("/.parquet")) |
There was a problem hiding this comment.
🤖 Nit: hardcoded gs:// prefix
fs.glob() returns paths without the protocol prefix, and this re-adds gs://. This couples the code to GCS. Consider deriving the prefix from input_path instead, e.g.:
from urllib.parse import urlparse
scheme = urlparse(input_path).scheme
prefix = f"{scheme}://" if scheme else ""
files = sorted(f"{prefix}{f}" for f in fs.glob(...) if not f.endswith("/.parquet"))| return StepSpec( | ||
| name=f"resharded/starcoder2_extras/{subset}", | ||
| fn=_run, | ||
| deps=[raw], | ||
| ) |
There was a problem hiding this comment.
🤖 Missing hash_attrs for target_shard_mb
The StepSpec doesn't include target_shard_mb in hash_attrs. If someone changes the shard size parameter, the cached output won't be invalidated.
| return StepSpec( | |
| name=f"resharded/starcoder2_extras/{subset}", | |
| fn=_run, | |
| deps=[raw], | |
| ) | |
| return StepSpec( | |
| name=f"resharded/starcoder2_extras/{subset}", | |
| fn=_run, | |
| deps=[raw], | |
| hash_attrs={"target_shard_mb": target_shard_mb}, | |
| ) |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f048673486
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| dataset=download.as_executor_step(), | ||
| tokenizer=tokenizer, | ||
| format=TextLmDatasetFormat(text_key="content"), | ||
| worker_resources=ResourceConfig(ram=ram, disk="10g"), |
There was a problem hiding this comment.
Remove unsupported
worker_resources kwarg
default_tokenize does not define a worker_resources parameter, so this call path will fail immediately with TypeError: default_tokenize() got an unexpected keyword argument 'worker_resources' when tokenize_starcoder2_extras() is executed. That means the new StarCoder2 extras experiment cannot run at all.
Useful? React with 👍 / 👎.
| meta = pq.read_metadata(file_path) | ||
| if meta.serialized_size <= target_bytes: |
There was a problem hiding this comment.
Use actual parquet file size for shard splitting
This logic uses meta.serialized_size to decide whether to split a parquet file, but that field is the serialized metadata/footer size, not the data file size. As a result, large files are typically treated as "small" and copied without splitting, so the reshard step does not actually rebalance oversized shards as intended.
Useful? React with 👍 / 👎.
| # Copyright The Marin Authors | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """StarCoder2 data extras: download and tokenize ir_cpp, ir_python, ir_rust, ir_low_resource, documentation.""" |
There was a problem hiding this comment.
🤖 Nit: docstring missing "kaggle"
The docstring lists subsets but omits "kaggle", which is in SUBSETS.
| """StarCoder2 data extras: download and tokenize ir_cpp, ir_python, ir_rust, ir_low_resource, documentation.""" | |
| """StarCoder2 data extras: download and tokenize ir_cpp, ir_python, ir_rust, ir_low_resource, documentation, kaggle.""" |
|
I'm stealing this PR --> #4626 |

Adds download/reshard helpers for the bigcode/starcoder2data-extras subsets (ir_cpp, ir_python, ir_rust, ir_low_resource, documentation, kaggle) and an experiment script that tokenizes each subset with the marin tokenizer. ir_low_resource is resharded to even out its parquet files and given 80g worker RAM.