Skip to content

Fasttext classifier code doesn't work in distributed/S3 environments #4183

@rjpower

Description

@rjpower

Problem

The fasttext classifier training and inference code (marin.processing.classification) has several issues that prevent it from running in distributed environments with S3-compatible storage (e.g. R2 on CoreWeave):

1. fs.makedirs("/tmp/...") on S3FileSystem tries CreateBucket

classifier.py:84 calls fs.makedirs(f"/tmp/{model_descriptor}") where fs is an S3FileSystem (from url_to_fs(model_name) when the model is on S3). This interprets /tmp as an S3 bucket name and calls CreateBucket, which fails with AccessDenied.

2. read_dataset_streaming uses HuggingFace datasets which injects incompatible kwargs

dataset_utils.py used datasets.load_dataset("json", ...) for reading .jsonl.gz files. The datasets library's CompressionFilesystem injects requote_redirect_url=False (an aiohttp parameter) into client_kwargs, which gets forwarded to aiobotocore.session.create_client() and raises TypeError: AioSession._create_client() got an unexpected keyword argument 'requote_redirect_url'.

3. Single-node assumptions

  • FileLock for coordinating model downloads only works on a single node
  • atexit.register(lambda: os.unlink(local_filepath)) deletes the model on process exit, forcing re-download for every Ray actor

Fixes applied in PR #4174

  • Replaced datasets.load_dataset with direct fsspec reads (fix 2)
  • Changed fs.makedirs to os.makedirs for local /tmp paths (fix 1)
  • Removed classifier steps from integration test since they can't reliably run on a cluster

TODO

  • Properly fix the classifier to work in distributed environments
  • Re-add classifier steps to the integration test once fixed

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions