Skip to content

Reaching server limits in olmo-data.org? #886

@spaidataiga

Description

@spaidataiga

🐛 Describe the bug

When trying to train olmo2-1B from a checkpoint, I've begun to see very poor/inconsistent connection to olmo-data.org in the last few weeks.

I wasn't sure if it was a rate-limit from my own cluster, but even by reducing num_workers, I continue to get these network errors. It does not consistently occur at specific data paths; often times when I see an error, I will try to curl the path from a different computer and get an error for a few seconds (when suddenly I am able to reach the link again).

This issue might be related to the slow dataloader loading speed reported in #869 and #864

Has the server been facing a lot of requests recently? Or do you have another idea as to what might be causing this issue?

CRITICAL Uncaught ConnectionError: Caught ConnectionError in DataLoader worker process 0. Original Traceback (most recent call last):
  File "<python_lib_path>/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
  File "<python_lib_path>/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "<python_lib_path>/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
OSError: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<python_lib_path>/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
  File "<python_lib_path>/urllib3/connectionpool.py", line 488, in _make_request
    raise new_e
  File "<python_lib_path>/urllib3/connectionpool.py", line 464, in _make_request
    self._validate_conn(conn)
  File "<python_lib_path>/urllib3/connectionpool.py", line 1093, in _validate_conn
    conn.connect()
  File "<python_lib_path>/urllib3/connection.py", line 753, in connect
    self.sock = sock = self._new_conn()
  File "<python_lib_path>/urllib3/connection.py", line 213, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object>: Failed to establish a new connection: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<python_lib_path>/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
  File "<python_lib_path>/urllib3/connectionpool.py", line 841, in urlopen
    retries = retries.increment(
  File "<python_lib_path>/urllib3/util/retry.py", line 519, in increment
    raise MaxRetryError(_pool, url, reason) from reason
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='olmo-data.org', port=443): Max retries exceeded with url: /preprocessed/dclm/.../part-075-00001.npy
(Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

Versions

Python 3.10.12
ai2-olmo==0.6.0
ai2-olmo-core==0.1.0
aiohappyeyeballs==2.6.1
aiohttp==3.12.15
aiosignal==1.4.0
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
apt-clone==0.2.1
asttokens==3.0.0
async-timeout==5.0.1
attrs==21.2.0
Automat==20.2.0
Babel==2.8.0
bcrypt==3.2.0
blinker==1.4
boto3==1.40.3
botocore==1.40.3
cached_path==1.7.3
cachetools==5.5.2
certifi==2020.6.20
chardet==4.0.0
charset-normalizer==3.4.2
click==8.0.3
cloud-init==25.1.2
colorama==0.4.4
comm==0.2.3
command-not-found==0.3
configobj==5.0.6
constantly==15.1.0
contourpy==1.3.2
cryptography==3.4.8
cycler==0.12.1
datasets==4.0.0
dbus-python==1.2.18
debugpy==1.8.15
decorator==5.2.1
diceware==0.9.6
dill==0.3.8
distro==1.7.0
distro-info==1.1+ubuntu0.2
dnspython==2.1.0
einops==0.8.1
exceptiongroup==1.3.0
executing==2.2.0
filelock==3.18.0
flash_attn==2.8.2
fonttools==4.59.2
frozenlist==1.7.0
fsspec==2025.3.0
gitdb==4.0.12
GitPython==3.1.45
google-api-core==2.25.1
google-auth==2.40.3
google-cloud-core==2.4.3
google-cloud-storage==2.19.0
google-crc32c==1.7.1
google-resumable-media==2.7.2
googleapis-common-protos==1.70.0
gpg==1.16.0
greenlet==1.1.2
gyp==0.1
hf-xet==1.1.7
httplib2==0.20.2
huggingface-hub==0.34.3
hyperlink==21.0.0
idna==3.3
importlib-metadata==4.6.4
importlib_resources==6.5.2
incremental==21.3.0
ipykernel==6.30.1
ipython==8.37.0
jedi==0.19.2
jeepney==0.7.1
Jinja2==3.0.3
jmespath==1.0.1
joblib==1.5.1
jsonpatch==1.32
jsonpointer==2.0
jsonschema==3.2.0
jupyter_client==8.6.3
jupyter_core==5.8.1
keyring==23.5.0
kiwisolver==1.4.9
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
lightning-utilities==0.15.1
Markdown==3.3.6
markdown-it-py==3.0.0
MarkupSafe==2.0.1
matplotlib==3.10.6
matplotlib-inline==0.1.7
mdurl==0.1.2
mercurial==6.1.1
more-itertools==8.10.0
mpmath==1.3.0
msgpack==1.0.3
multidict==6.6.3
multiprocess==0.70.16
nest-asyncio==1.6.0
netifaces==0.11.0
networkx==3.4.2
numpy==2.2.6
nvidia-cublas-cu12==12.6.4.1
nvidia-cuda-cupti-cu12==12.6.80
nvidia-cuda-nvrtc-cu12==12.6.77
nvidia-cuda-runtime-cu12==12.6.77
nvidia-cudnn-cu12==9.5.1.17
nvidia-cufft-cu12==11.3.0.4
nvidia-cufile-cu12==1.11.1.6
nvidia-curand-cu12==10.3.7.77
nvidia-cusolver-cu12==11.7.1.2
nvidia-cusparse-cu12==12.5.4.2
nvidia-cusparselt-cu12==0.6.3
nvidia-nccl-cu12==2.26.2
nvidia-nvjitlink-cu12==12.6.85
nvidia-nvtx-cu12==12.6.77
oauthlib==3.2.0
omegaconf==2.3.0
packaging==25.0
pandas==2.3.1
parso==0.8.4
pexpect==4.8.0
pillow==11.3.0
platformdirs==4.3.8
prompt_toolkit==3.0.51
propcache==0.3.2
proto-plus==1.26.1
protobuf==6.31.1
psutil==7.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
pyarrow==21.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.1
pydantic==2.11.7
pydantic_core==2.33.2
Pygments==2.19.2
PyGObject==3.42.1
PyHamcrest==2.0.2
PyJWT==2.3.0
pynvim==0.4.2
pyOpenSSL==21.0.0
pyparsing==2.4.7
pyrsistent==0.18.1
pyserial==3.5
python-apt==2.4.0+ubuntu4
python-dateutil==2.9.0.post0
python-debian==0.1.43+ubuntu1.1
python-magic==0.4.24
pytz==2022.1
PyYAML==5.4.1
pyzmq==27.0.1
regex==2025.7.34
requests==2.32.4
requests-toolbelt==0.9.1
rich==13.9.4
rsa==4.9.1
s3transfer==0.13.1
safetensors==0.6.1
scikit-learn==1.7.1
scipy==1.15.3
screen-resolution-extra==0.0.0
SecretStorage==3.3.1
sentry-sdk==2.34.1
service-identity==18.1.0
six==1.16.0
smmap==5.0.2
sos==4.8.2
ssh-import-id==5.11
stack-data==0.6.3
sympy==1.14.0
systemd-python==234
threadpoolctl==3.6.0
tokenizers==0.21.4
torch==2.7.1
torchaudio==2.7.1
torchmetrics==1.8.0
torchvision==0.22.1
tornado==6.5.1
tqdm==4.67.1
traitlets==5.14.3
transformers==4.55.0
triton==3.3.1
Twisted==22.1.0
typing-inspection==0.4.1
typing_extensions==4.14.1
tzdata==2025.2
ubuntu-drivers-common==0.0.0
ubuntu-pro-client==8001
ufw==0.36.1
unattended-upgrades==0.1
urllib3==2.5.0
vboxapi==1.0
wadllib==1.3.6
wandb==0.21.0
wcwidth==0.2.13
xkit==0.0.0
xxhash==3.5.0
yarl==1.20.1
zipp==1.0.0
zope.interface==5.4.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugAn issue about a bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions