Skip to content

Commit 995f167

Browse files
authored
Remove uses of FUSE (#2308)
## Description Remove all uses of gcsfuse etc, replace with either reading direct from gcs or in one case downloading from gcs to a tmp dir.
1 parent 76595f5 commit 995f167

40 files changed

Lines changed: 918 additions & 618 deletions

AGENTS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@
3636
- Put all imports at the top of the file. Avoid local imports unless technically necessary (for example, to break circular dependencies or guard optional dependencies).
3737
- Prefer top-level functions when code does not mutate shared state; use classes to encapsulate data when that improves clarity.
3838
- Prefer top-level Python tests and fixtures.
39+
- Disprefer internal mutation of function arguments, especially config dataclasses; prefer returning a modified copy (e.g., via `dataclasses.replace`) so call sites remain predictable and side effects are explicit.
3940
- Use early returns (`if not x: return None`) when they reduce nesting.
4041
- Do not introduce ad-hoc compatibility hacks like `hasattr(m, "old_attr")`; update the code consistently instead.
4142
- Do not use `from future import ...` statements.

experiments/exp1342_gemstones_scaling_law.py

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,16 +21,15 @@
2121
Usage:
2222
1. Import the model step you want to use
2323
2. Run executor_main([model_step]) to download
24-
3. Use get_model_local_path(model_step) to get the local path
24+
3. Use the model step's output path for downstream jobs
2525
2626
Example:
2727
```
2828
from gemstones import gemstone_768x45
29-
from marin.execution.executor import executor_main
30-
from experiments.models import get_model_local_path
29+
from marin.execution.executor import executor_main, output_path_of
3130
3231
executor_main([gemstone_768x45])
33-
local_path = get_model_local_path(gemstone_768x45)
32+
model_path = output_path_of(gemstone_768x45)
3433
```
3534
"""
3635

experiments/models.py

Lines changed: 3 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -17,20 +17,16 @@
1717
Usage:
1818
1. If you have a model you want to download from huggingface, add the repo name and config in MODEL_NAME_TO_CONFIG.
1919
2. Run download_model_step(MODEL_NAME_TO_CONFIG[model_name]) to download the model.
20-
3. Use get_model_local_path(model_name) to get the local path of the model.
2120
2221
Example:
2322
```
2423
model_name = "meta-llama/Llama-3.1-8B-Instruct"
2524
model_config = MODEL_NAME_TO_CONFIG[model_name]
2625
download_step = download_model_step(model_config)
2726
executor_main([download_step])
28-
29-
local_path = get_model_local_path(model_name)
3027
```
3128
"""
3229

33-
import os
3430
from dataclasses import dataclass
3531

3632
from marin.download.huggingface.download_hf import DownloadConfig, download_hf
@@ -44,19 +40,14 @@ class ModelConfig:
4440
hf_revision: str
4541

4642

47-
# We utilize GCSFuse because our disk space is limited on TPUs.
48-
# This means that for certain large models (e.g. Llama 70B), we will not be able
49-
# to fit the models on local disk. We use GCSFuse to mount the GCS bucket to the local filesystem
50-
# to be able to download and use these large models.
51-
LOCAL_PREFIX = "/opt"
52-
GCS_FUSE_MOUNT_PATH = "gcsfuse_mount/models"
43+
MODEL_OUTPUT_SUBDIR = "models"
5344

5445

5546
def download_model_step(model_config: ModelConfig) -> ExecutorStep:
5647
model_name = get_directory_friendly_name(model_config.hf_repo_id)
5748
model_revision = get_directory_friendly_name(model_config.hf_revision)
5849
download_step = ExecutorStep(
59-
name=f"{GCS_FUSE_MOUNT_PATH}/{model_name}--{model_revision}",
50+
name=f"{MODEL_OUTPUT_SUBDIR}/{model_name}--{model_revision}",
6051
fn=download_hf,
6152
config=DownloadConfig(
6253
hf_dataset_id=model_config.hf_repo_id,
@@ -67,17 +58,12 @@ def download_model_step(model_config: ModelConfig) -> ExecutorStep:
6758
),
6859
# must override because it because if we don't then it will end in a hash
6960
# if it ends in a hash, then we cannot determine the local path
70-
override_output_path=f"{GCS_FUSE_MOUNT_PATH}/{model_name}--{model_revision}",
61+
override_output_path=f"{MODEL_OUTPUT_SUBDIR}/{model_name}--{model_revision}",
7162
)
7263

7364
return download_step
7465

7566

76-
def get_model_local_path(step: ExecutorStep) -> str:
77-
model_repo_name = step.name[len(GCS_FUSE_MOUNT_PATH) + 1 :]
78-
return os.path.join(LOCAL_PREFIX, GCS_FUSE_MOUNT_PATH, model_repo_name)
79-
80-
8167
smollm2_1_7b_instruct = download_model_step(
8268
ModelConfig(
8369
hf_repo_id="HuggingFaceTB/SmolLM2-1.7B-Instruct",

infra/marin-big-run.yaml

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,13 @@ docker:
3737
- --shm-size=100gb
3838
- -v
3939
- "/tmp:/tmp"
40+
- -e MARIN_PREFIX=gs://marin-us-central2
41+
- -e BUCKET=marin-us-central2
42+
- -e MARIN_LOCAL_CACHE_DIR=/tmp/marin-cache
43+
- -e AUTOSCALER_HEARTBEAT_TIMEOUT_S=600
44+
- -e TPU_MIN_LOG_LEVEL=3
45+
- -e TPU_STDERR_LOG_LEVEL=3
46+
- -e TPU_LOG_DIR=disabled
4047
# this lets the worker run docker commands and have them run as sibling containers
4148
- -v "/var/run/docker.sock:/var/run/docker.sock"
4249

@@ -48,6 +55,13 @@ docker:
4855
- -v "/tmp:/tmp"
4956
- --ulimit nofile=1048576:1048576
5057
- -e RAY_TPU_MAX_CONCURRENT_ACTIVE_CONNECTIONS=64
58+
- -e MARIN_PREFIX=gs://marin-us-central2
59+
- -e BUCKET=marin-us-central2
60+
- -e MARIN_LOCAL_CACHE_DIR=/tmp/marin-cache
61+
- -e AUTOSCALER_HEARTBEAT_TIMEOUT_S=600
62+
- -e TPU_MIN_LOG_LEVEL=3
63+
- -e TPU_STDERR_LOG_LEVEL=3
64+
- -e TPU_LOG_DIR=disabled
5165

5266
- -e RAY_AUTH_MODE=token
5367
- -e RAY_AUTH_TOKEN_PATH=/home/ray/.ray/auth_token
@@ -79,15 +93,9 @@ setup_commands:
7993
- gcloud secrets versions access latest --secret=RAY_AUTH_TOKEN > $HOME/.ray/auth_token
8094
- chmod 600 $HOME/.ray/auth_token
8195

82-
- echo 'export MARIN_PREFIX="gs://marin-us-central2"' >> $HOME/.bashrc
83-
- echo 'export BUCKET="marin-us-central2"' >> $HOME/.bashrc
96+
- mkdir -p /tmp/marin-cache
8497
# cf https://github.com/ray-project/ray/blob/0bc6ec86ffd0fc0d4e43fb339ffe0ac03ee5531b/python/ray/autoscaler/_private/constants.py#L66
8598
# this is set to 30s by default, which is much too short for our use case
86-
- echo 'export AUTOSCALER_HEARTBEAT_TIMEOUT_S=600' >> $HOME/.bashrc
87-
- echo 'export TPU_MIN_LOG_LEVEL=3' >> $HOME/.bashrc
88-
- echo 'export TPU_STDERR_LOG_LEVEL=3' >> $HOME/.bashrc
89-
- echo 'export TPU_LOG_DIR=disabled' >> $HOME/.bashrc
90-
- gcsfuse --implicit-dirs --client-protocol grpc --cache-dir /dev/shm --file-cache-max-size-mb 160000 --only-dir gcsfuse_mount $BUCKET /opt/gcsfuse_mount || true
9199

92100
worker_setup_commands:
93101
# delete any old ray session data

infra/marin-cluster-template.yaml

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,13 @@ docker:
3333
- --shm-size=100gb
3434
- -v
3535
- "/tmp:/tmp"
36+
- -e MARIN_PREFIX=gs://{{BUCKET}}
37+
- -e BUCKET={{BUCKET}}
38+
- -e MARIN_LOCAL_CACHE_DIR=/tmp/marin-cache
39+
- -e AUTOSCALER_HEARTBEAT_TIMEOUT_S=600
40+
- -e TPU_MIN_LOG_LEVEL=3
41+
- -e TPU_STDERR_LOG_LEVEL=3
42+
- -e TPU_LOG_DIR=disabled
3643
# this lets the worker run docker commands and have them run as sibling containers
3744
- -v "/var/run/docker.sock:/var/run/docker.sock"
3845
{% if RAY_AUTH_MODE == "token" %}
@@ -44,6 +51,13 @@ docker:
4451
- -v "/tmp:/tmp"
4552
- --ulimit nofile=1048576:1048576
4653
- -e RAY_TPU_MAX_CONCURRENT_ACTIVE_CONNECTIONS=64
54+
- -e MARIN_PREFIX=gs://{{BUCKET}}
55+
- -e BUCKET={{BUCKET}}
56+
- -e MARIN_LOCAL_CACHE_DIR=/tmp/marin-cache
57+
- -e AUTOSCALER_HEARTBEAT_TIMEOUT_S=600
58+
- -e TPU_MIN_LOG_LEVEL=3
59+
- -e TPU_STDERR_LOG_LEVEL=3
60+
- -e TPU_LOG_DIR=disabled
4761
{% if RAY_AUTH_MODE == "token" %}
4862
- -e RAY_AUTH_MODE=token
4963
- -e RAY_AUTH_TOKEN_PATH=/home/ray/.ray/auth_token
@@ -75,15 +89,9 @@ setup_commands:
7589
- gcloud secrets versions access latest --secret={{ RAY_AUTH_SECRET }} > $HOME/.ray/auth_token
7690
- chmod 600 $HOME/.ray/auth_token
7791
{% endif %}
78-
- echo 'export MARIN_PREFIX="gs://{{BUCKET}}"' >> $HOME/.bashrc
79-
- echo 'export BUCKET="{{BUCKET}}"' >> $HOME/.bashrc
92+
- mkdir -p /tmp/marin-cache
8093
# cf https://github.com/ray-project/ray/blob/0bc6ec86ffd0fc0d4e43fb339ffe0ac03ee5531b/python/ray/autoscaler/_private/constants.py#L66
8194
# this is set to 30s by default, which is much too short for our use case
82-
- echo 'export AUTOSCALER_HEARTBEAT_TIMEOUT_S=600' >> $HOME/.bashrc
83-
- echo 'export TPU_MIN_LOG_LEVEL=3' >> $HOME/.bashrc
84-
- echo 'export TPU_STDERR_LOG_LEVEL=3' >> $HOME/.bashrc
85-
- echo 'export TPU_LOG_DIR=disabled' >> $HOME/.bashrc
86-
- gcsfuse --implicit-dirs --client-protocol grpc --cache-dir /dev/shm --file-cache-max-size-mb 160000 --only-dir gcsfuse_mount $BUCKET /opt/gcsfuse_mount || true
8795

8896
worker_setup_commands:
8997
# delete any old ray session data

infra/marin-eu-west4-a.yaml

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,13 @@ docker:
3737
- --shm-size=100gb
3838
- -v
3939
- "/tmp:/tmp"
40+
- -e MARIN_PREFIX=gs://marin-eu-west4
41+
- -e BUCKET=marin-eu-west4
42+
- -e MARIN_LOCAL_CACHE_DIR=/tmp/marin-cache
43+
- -e AUTOSCALER_HEARTBEAT_TIMEOUT_S=600
44+
- -e TPU_MIN_LOG_LEVEL=3
45+
- -e TPU_STDERR_LOG_LEVEL=3
46+
- -e TPU_LOG_DIR=disabled
4047
# this lets the worker run docker commands and have them run as sibling containers
4148
- -v "/var/run/docker.sock:/var/run/docker.sock"
4249

@@ -48,6 +55,13 @@ docker:
4855
- -v "/tmp:/tmp"
4956
- --ulimit nofile=1048576:1048576
5057
- -e RAY_TPU_MAX_CONCURRENT_ACTIVE_CONNECTIONS=64
58+
- -e MARIN_PREFIX=gs://marin-eu-west4
59+
- -e BUCKET=marin-eu-west4
60+
- -e MARIN_LOCAL_CACHE_DIR=/tmp/marin-cache
61+
- -e AUTOSCALER_HEARTBEAT_TIMEOUT_S=600
62+
- -e TPU_MIN_LOG_LEVEL=3
63+
- -e TPU_STDERR_LOG_LEVEL=3
64+
- -e TPU_LOG_DIR=disabled
5165

5266
- -e RAY_AUTH_MODE=token
5367
- -e RAY_AUTH_TOKEN_PATH=/home/ray/.ray/auth_token
@@ -79,15 +93,9 @@ setup_commands:
7993
- gcloud secrets versions access latest --secret=RAY_AUTH_TOKEN > $HOME/.ray/auth_token
8094
- chmod 600 $HOME/.ray/auth_token
8195

82-
- echo 'export MARIN_PREFIX="gs://marin-eu-west4"' >> $HOME/.bashrc
83-
- echo 'export BUCKET="marin-eu-west4"' >> $HOME/.bashrc
96+
- mkdir -p /tmp/marin-cache
8497
# cf https://github.com/ray-project/ray/blob/0bc6ec86ffd0fc0d4e43fb339ffe0ac03ee5531b/python/ray/autoscaler/_private/constants.py#L66
8598
# this is set to 30s by default, which is much too short for our use case
86-
- echo 'export AUTOSCALER_HEARTBEAT_TIMEOUT_S=600' >> $HOME/.bashrc
87-
- echo 'export TPU_MIN_LOG_LEVEL=3' >> $HOME/.bashrc
88-
- echo 'export TPU_STDERR_LOG_LEVEL=3' >> $HOME/.bashrc
89-
- echo 'export TPU_LOG_DIR=disabled' >> $HOME/.bashrc
90-
- gcsfuse --implicit-dirs --client-protocol grpc --cache-dir /dev/shm --file-cache-max-size-mb 160000 --only-dir gcsfuse_mount $BUCKET /opt/gcsfuse_mount || true
9199

92100
worker_setup_commands:
93101
# delete any old ray session data

infra/marin-eu-west4-vllm.yaml

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,13 @@ docker:
3434
- --shm-size=200gb
3535
- -v
3636
- "/tmp:/tmp"
37+
- -e MARIN_PREFIX=gs://marin-eu-west4
38+
- -e BUCKET=marin-eu-west4
39+
- -e MARIN_LOCAL_CACHE_DIR=/tmp/marin-cache
40+
- -e AUTOSCALER_HEARTBEAT_TIMEOUT_S=600
41+
- -e TPU_MIN_LOG_LEVEL=3
42+
- -e TPU_STDERR_LOG_LEVEL=3
43+
- -e TPU_LOG_DIR=disabled
3744
# this lets the worker run docker commands and have them run as sibling containers
3845
- -v "/var/run/docker.sock:/var/run/docker.sock"
3946

@@ -45,6 +52,13 @@ docker:
4552
- -v "/tmp:/tmp"
4653
- --ulimit nofile=1048576:1048576
4754
- -e RAY_TPU_MAX_CONCURRENT_ACTIVE_CONNECTIONS=64
55+
- -e MARIN_PREFIX=gs://marin-eu-west4
56+
- -e BUCKET=marin-eu-west4
57+
- -e MARIN_LOCAL_CACHE_DIR=/tmp/marin-cache
58+
- -e AUTOSCALER_HEARTBEAT_TIMEOUT_S=600
59+
- -e TPU_MIN_LOG_LEVEL=3
60+
- -e TPU_STDERR_LOG_LEVEL=3
61+
- -e TPU_LOG_DIR=disabled
4862

4963
- -e RAY_AUTH_MODE=token
5064
- -e RAY_AUTH_TOKEN_PATH=/home/ray/.ray/auth_token
@@ -76,15 +90,9 @@ setup_commands:
7690
- gcloud secrets versions access latest --secret=RAY_AUTH_TOKEN > $HOME/.ray/auth_token
7791
- chmod 600 $HOME/.ray/auth_token
7892

79-
- echo 'export MARIN_PREFIX="gs://marin-eu-west4"' >> $HOME/.bashrc
80-
- echo 'export BUCKET="marin-eu-west4"' >> $HOME/.bashrc
93+
- mkdir -p /tmp/marin-cache
8194
# cf https://github.com/ray-project/ray/blob/0bc6ec86ffd0fc0d4e43fb339ffe0ac03ee5531b/python/ray/autoscaler/_private/constants.py#L66
8295
# this is set to 30s by default, which is much too short for our use case
83-
- echo 'export AUTOSCALER_HEARTBEAT_TIMEOUT_S=600' >> $HOME/.bashrc
84-
- echo 'export TPU_MIN_LOG_LEVEL=3' >> $HOME/.bashrc
85-
- echo 'export TPU_STDERR_LOG_LEVEL=3' >> $HOME/.bashrc
86-
- echo 'export TPU_LOG_DIR=disabled' >> $HOME/.bashrc
87-
- gcsfuse --implicit-dirs --client-protocol grpc --cache-dir /dev/shm --file-cache-max-size-mb 160000 --only-dir gcsfuse_mount $BUCKET /opt/gcsfuse_mount || true
8896

8997
worker_setup_commands:
9098
# delete any old ray session data

infra/marin-eu-west4.yaml

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,13 @@ docker:
3737
- --shm-size=100gb
3838
- -v
3939
- "/tmp:/tmp"
40+
- -e MARIN_PREFIX=gs://marin-eu-west4
41+
- -e BUCKET=marin-eu-west4
42+
- -e MARIN_LOCAL_CACHE_DIR=/tmp/marin-cache
43+
- -e AUTOSCALER_HEARTBEAT_TIMEOUT_S=600
44+
- -e TPU_MIN_LOG_LEVEL=3
45+
- -e TPU_STDERR_LOG_LEVEL=3
46+
- -e TPU_LOG_DIR=disabled
4047
# this lets the worker run docker commands and have them run as sibling containers
4148
- -v "/var/run/docker.sock:/var/run/docker.sock"
4249

@@ -48,6 +55,13 @@ docker:
4855
- -v "/tmp:/tmp"
4956
- --ulimit nofile=1048576:1048576
5057
- -e RAY_TPU_MAX_CONCURRENT_ACTIVE_CONNECTIONS=64
58+
- -e MARIN_PREFIX=gs://marin-eu-west4
59+
- -e BUCKET=marin-eu-west4
60+
- -e MARIN_LOCAL_CACHE_DIR=/tmp/marin-cache
61+
- -e AUTOSCALER_HEARTBEAT_TIMEOUT_S=600
62+
- -e TPU_MIN_LOG_LEVEL=3
63+
- -e TPU_STDERR_LOG_LEVEL=3
64+
- -e TPU_LOG_DIR=disabled
5165

5266
- -e RAY_AUTH_MODE=token
5367
- -e RAY_AUTH_TOKEN_PATH=/home/ray/.ray/auth_token
@@ -79,15 +93,9 @@ setup_commands:
7993
- gcloud secrets versions access latest --secret=RAY_AUTH_TOKEN > $HOME/.ray/auth_token
8094
- chmod 600 $HOME/.ray/auth_token
8195

82-
- echo 'export MARIN_PREFIX="gs://marin-eu-west4"' >> $HOME/.bashrc
83-
- echo 'export BUCKET="marin-eu-west4"' >> $HOME/.bashrc
96+
- mkdir -p /tmp/marin-cache
8497
# cf https://github.com/ray-project/ray/blob/0bc6ec86ffd0fc0d4e43fb339ffe0ac03ee5531b/python/ray/autoscaler/_private/constants.py#L66
8598
# this is set to 30s by default, which is much too short for our use case
86-
- echo 'export AUTOSCALER_HEARTBEAT_TIMEOUT_S=600' >> $HOME/.bashrc
87-
- echo 'export TPU_MIN_LOG_LEVEL=3' >> $HOME/.bashrc
88-
- echo 'export TPU_STDERR_LOG_LEVEL=3' >> $HOME/.bashrc
89-
- echo 'export TPU_LOG_DIR=disabled' >> $HOME/.bashrc
90-
- gcsfuse --implicit-dirs --client-protocol grpc --cache-dir /dev/shm --file-cache-max-size-mb 160000 --only-dir gcsfuse_mount $BUCKET /opt/gcsfuse_mount || true
9199

92100
worker_setup_commands:
93101
# delete any old ray session data

infra/marin-us-central1-vllm.yaml

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,13 @@ docker:
3434
- --shm-size=200gb
3535
- -v
3636
- "/tmp:/tmp"
37+
- -e MARIN_PREFIX=gs://marin-us-central1
38+
- -e BUCKET=marin-us-central1
39+
- -e MARIN_LOCAL_CACHE_DIR=/tmp/marin-cache
40+
- -e AUTOSCALER_HEARTBEAT_TIMEOUT_S=600
41+
- -e TPU_MIN_LOG_LEVEL=3
42+
- -e TPU_STDERR_LOG_LEVEL=3
43+
- -e TPU_LOG_DIR=disabled
3744
# this lets the worker run docker commands and have them run as sibling containers
3845
- -v "/var/run/docker.sock:/var/run/docker.sock"
3946

@@ -45,6 +52,13 @@ docker:
4552
- -v "/tmp:/tmp"
4653
- --ulimit nofile=1048576:1048576
4754
- -e RAY_TPU_MAX_CONCURRENT_ACTIVE_CONNECTIONS=64
55+
- -e MARIN_PREFIX=gs://marin-us-central1
56+
- -e BUCKET=marin-us-central1
57+
- -e MARIN_LOCAL_CACHE_DIR=/tmp/marin-cache
58+
- -e AUTOSCALER_HEARTBEAT_TIMEOUT_S=600
59+
- -e TPU_MIN_LOG_LEVEL=3
60+
- -e TPU_STDERR_LOG_LEVEL=3
61+
- -e TPU_LOG_DIR=disabled
4862

4963
- -e RAY_AUTH_MODE=token
5064
- -e RAY_AUTH_TOKEN_PATH=/home/ray/.ray/auth_token
@@ -76,15 +90,9 @@ setup_commands:
7690
- gcloud secrets versions access latest --secret=RAY_AUTH_TOKEN > $HOME/.ray/auth_token
7791
- chmod 600 $HOME/.ray/auth_token
7892

79-
- echo 'export MARIN_PREFIX="gs://marin-us-central1"' >> $HOME/.bashrc
80-
- echo 'export BUCKET="marin-us-central1"' >> $HOME/.bashrc
93+
- mkdir -p /tmp/marin-cache
8194
# cf https://github.com/ray-project/ray/blob/0bc6ec86ffd0fc0d4e43fb339ffe0ac03ee5531b/python/ray/autoscaler/_private/constants.py#L66
8295
# this is set to 30s by default, which is much too short for our use case
83-
- echo 'export AUTOSCALER_HEARTBEAT_TIMEOUT_S=600' >> $HOME/.bashrc
84-
- echo 'export TPU_MIN_LOG_LEVEL=3' >> $HOME/.bashrc
85-
- echo 'export TPU_STDERR_LOG_LEVEL=3' >> $HOME/.bashrc
86-
- echo 'export TPU_LOG_DIR=disabled' >> $HOME/.bashrc
87-
- gcsfuse --implicit-dirs --client-protocol grpc --cache-dir /dev/shm --file-cache-max-size-mb 160000 --only-dir gcsfuse_mount $BUCKET /opt/gcsfuse_mount || true
8896

8997
worker_setup_commands:
9098
# delete any old ray session data

0 commit comments

Comments
 (0)