Skip to content

[BUG: validate_data.py ModuleNotFoundError (finetune & tensorflow) #98

@CorentinWicht

Description

@CorentinWicht

Python Version

Python 3.10.12

Pip Freeze

absl-py==2.1.0
annotated-types==0.7.0
astunparse==1.6.3
attrs==24.2.0
beautifulsoup4==4.12.3
blis==0.7.11
bs4==0.0.2
catalogue==2.0.10
certifi==2024.8.30
charset-normalizer==3.3.2
click==8.1.7
cloudpathlib==0.19.0
confection==0.1.5
cramjam==2.8.3
cymem==2.0.8
docstring_parser==0.16
fastparquet==2024.5.0
filelock==3.15.4
finetune==0.10.0
fire==0.6.0
flatbuffers==24.3.25
fsspec==2024.9.0
ftfy==6.2.3
gast==0.6.0
google-pasta==0.2.0
grpcio==1.66.1
h5py==3.11.0
huggingface-hub==0.24.6
idna==3.8
Jinja2==3.1.4
joblib==1.4.2
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
keras==3.5.0
langcodes==3.4.0
language_data==1.2.0
libclang==18.1.1
lxml==5.3.0
marisa-trie==1.2.0
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
mistral_common==1.3.4
ml-dtypes==0.4.0
mpmath==1.3.0
murmurhash==1.0.10
namex==0.0.8
networkx==3.3
nltk==3.9.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.6.68
nvidia-nvtx-cu12==12.1.105
opt-einsum==3.3.0
optree==0.12.1
packaging==24.1
pandas==2.2.2
preshed==3.0.9
protobuf==4.25.4
psutil==5.7.0
pyarrow==17.0.0
pydantic==2.9.0
pydantic_core==2.23.2
Pygments==2.18.0
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.2
referencing==0.35.1
regex==2024.7.24
requests==2.32.3
rich==13.8.0
rpds-py==0.20.0
safetensors==0.4.5
scikit-learn==1.5.1
scipy==1.14.1
sentencepiece==0.2.0
shellingham==1.5.4
simple-parsing==0.1.6
six==1.16.0
smart-open==7.0.4
soupsieve==2.6
spacy==3.7.6
spacy-legacy==3.0.12
spacy-loggers==1.0.5
srsly==2.4.8
sympy==1.13.2
tabulate==0.8.10
tensorboard==2.17.1
tensorboard-data-server==0.7.2
tensorflow==2.17.0
tensorflow-addons==0.16.1
tensorflow-estimator==2.11.0
tensorflow-io-gcs-filesystem==0.37.1
termcolor==2.4.0
thinc==8.2.5
threadpoolctl==3.5.0
tiktoken==0.7.0
tokenizers==0.13.3
torch==2.2.0
tqdl==0.0.4
tqdm==4.66.5
transformers==4.25.1
triton==2.2.0
typeguard==4.3.0
typer==0.12.5
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
wasabi==1.1.3
wcwidth==0.2.13
weasel==0.4.1
Werkzeug==3.0.4
wrapt==1.16.0
xformers==0.0.24

Reproduction Steps

  1. Follow instructions from the README
  2. Run the validation script in a python virtual environment as python ./mistral-finetune/utils/validate_data.py --train_yaml ./mistral-finetune/example/7B.yaml

Expected Behavior

According to the README, it should return a "a summary of the data input and training parameters" such as:

Train States
 --------------------
{
   "expected": {
       "eta": "00:52:44",
       "data_tokens": 25169147,
       "train_tokens": 131072000,
       "epochs": "5.21",
       "max_steps": 500,
       "data_tokens_per_dataset": {
           "/Users/johndoe/data/ultrachat_chunk_train.jsonl": "25169147.0"
       },
       "train_tokens_per_dataset": {
           "/Users/johndoe/data/ultrachat_chunk_train.jsonl": "131072000.0"
       },
       "epochs_per_dataset": {
           "/Users/johndoe/data/ultrachat_chunk_train.jsonl": "5.2"
       }
   },
}

Additional Context

The script returns the following error:

Traceback (most recent call last):
  File "/cluster/flash/wichtco/ai-fine-tuning/./mistral-finetune/utils/validate_data.py", line 16, in <module>
    from finetune.args import TrainArgs
ModuleNotFoundError: No module named 'finetune'

When installing the latest 'finetune-0.10.0' release, it returns a second error also related to a missing package:

Traceback (most recent call last):
  File "/cluster/flash/wichtco/ai-fine-tuning/./mistral-finetune/utils/validate_data.py", line 16, in <module>
    from finetune.args import TrainArgs
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/finetune/__init__.py", line 12, in <module>
    import tensorflow as tf
ModuleNotFoundError: No module named 'tensorflow'

Suggested Solutions

When installing the second missing package 'tensorflow-2.17.0' the problem should be fixed though it returns a pip's depencendy conflict:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
finetune 0.10.0 requires numpy<1.24.0,>=1.18.4, but you have numpy 1.26.4 which is incompatible.

Since finetune 0.10.0 requires numpy <1.24.0 while tensorflow-2.17.0 requires version numpy 1.26.4, I really don't see how I could make your script work.

Any idea?

Best,

C.

Follow up:

Command torchrun --nproc-per-node 8 --master_port $RANDOM -m train example/7B.yamltorchrun --nproc-per-node 8 --master_port $RANDOM -m train example/7B.yaml seems to fail as well due to a missing package:

[2024-09-06 16:02:17,003] torch.distributed.run: [WARNING]
[2024-09-06 16:02:17,003] torch.distributed.run: [WARNING] *****************************************
[2024-09-06 16:02:17,003] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-09-06 16:02:17,003] torch.distributed.run: [WARNING] *****************************************
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train

/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
[2024-09-06 16:02:22,013] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 422322) of binary: /cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python
Traceback (most recent call last):
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train FAILED
------------------------------------------------------------

And when trying to install 'train-0.0.5', I got another pip's dependency conflict with the same packages as above:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.2.5 requires numpy<2.0.0,>=1.19.0; python_version >= "3.9", but you have numpy 2.1.1 which is incompatible.
tensorflow 2.17.0 requires numpy<2.0.0,>=1.23.5; python_version <= "3.11", but you have numpy 2.1.1 which is incompatible.
finetune 0.10.0 requires numpy<1.24.0,>=1.18.4, but you have numpy 2.1.1 which is incompatible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions