Skip to content

Test new tf on axlearn for EFA#2116

Open
Steboss wants to merge 27 commits into
mainfrom
sbosisio/test-axlearn-new-tf
Open

Test new tf on axlearn for EFA#2116
Steboss wants to merge 27 commits into
mainfrom
sbosisio/test-axlearn-new-tf

Conversation

@Steboss

@Steboss Steboss commented May 20, 2026

Copy link
Copy Markdown
Contributor

No description provided.

@copy-pr-bot

copy-pr-bot Bot commented May 20, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Steboss

Steboss commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 7665ae4

1 similar comment
@Steboss

Steboss commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 7665ae4

@Steboss

Steboss commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 7665ae4

@Steboss

Steboss commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test e6dab45

@Steboss

Steboss commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test bac0bf6

@Steboss Steboss requested review from aybchan and olupton May 20, 2026 13:33
olupton
olupton previously approved these changes May 20, 2026

@olupton olupton left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if the test job passes

@Steboss

Steboss commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 9523b6d

Comment thread .github/workflows/_ci.yaml Outdated
Comment on lines 549 to 550

@aybchan aybchan May 20, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Remove for MaxText on EKS as well

@Steboss

Steboss commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test c2a0937

@Steboss Steboss requested a review from aybchan May 21, 2026 09:33
fsdp: 2
tensor-parallel: 2
envs: |-
OFI_NCCL_PROTOCOL=SENDRECV

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observed a significant regression for MaxText from > 260 TFLOP/s/device to < 20 TFLOPS/s/device without SENDRECV protocol is this expected?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah wait, I forget to make sure MaxText can have the latest TF. Working on it

@Steboss

Steboss commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 7b0d42b

@Steboss

Steboss commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test c56948a

@Steboss Steboss marked this pull request as draft May 29, 2026 12:40
@Steboss

Steboss commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 473c130

@Steboss

Steboss commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 0e07e2f

@Steboss

Steboss commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test e376b7a

@Steboss

Steboss commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 71cb8bc

1 similar comment
@Steboss

Steboss commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 71cb8bc

@Steboss

Steboss commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 16ab71d

@Steboss

Steboss commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 9579fca

@Steboss

Steboss commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test b3a0666

@Steboss

Steboss commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 7e90ea5

@Steboss

Steboss commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test b3595d1

@Steboss

Steboss commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 968a62d

@Steboss

Steboss commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 0663fa9

@Steboss

Steboss commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test bcc45cd

@Steboss

Steboss commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test f9c9dc0

@Steboss

Steboss commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

It looks like updating to tf-nightly is quite hard:

  • with the current setup tensorflow-text is giving this error when running AXLearn fuji models:
Traceback (most recent call last):
  File "/usr/local/bin/fuji-train-perf.py", line 9, in <module>
    from axlearn.experiments.text.gpt import c4_trainer
  File "/opt/axlearn/axlearn/experiments/text/gpt/__init__.py", line 5, in <module>
    from axlearn.experiments.text.gpt import (  # pytype: disable=pyi-error
  File "/opt/axlearn/axlearn/experiments/text/gpt/c4_trainer.py", line 49, in <module>
    from axlearn.common.input_lm import lm_text_preprocessor
  File "/opt/axlearn/axlearn/common/input_lm.py", line 12, in <module>
    import seqio
  File "/usr/local/lib/python3.12/dist-packages/seqio/__init__.py", line 19, in <module>
    from seqio.dataset_providers import *
  File "/usr/local/lib/python3.12/dist-packages/seqio/dataset_providers.py", line 40, in <module>
    from seqio import metrics as metrics_lib
  File "/usr/local/lib/python3.12/dist-packages/seqio/metrics.py", line 27, in <module>
    from seqio import utils
  File "/usr/local/lib/python3.12/dist-packages/seqio/utils.py", line 31, in <module>
    from seqio.vocabularies import Vocabulary
  File "/usr/local/lib/python3.12/dist-packages/seqio/vocabularies.py", line 26, in <module>
    import tensorflow_text as tf_text
  File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/__init__.py", line 21, in <module>
    from tensorflow_text.python import keras
  File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/keras/__init__.py", line 21, in <module>
    from tensorflow_text.python.keras.layers import *
  File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/keras/layers/__init__.py", line 22, in <module>
    from tensorflow_text.python.keras.layers.tokenization_layers import *
  File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/keras/layers/tokenization_layers.py", line 24, in <module>
    from tensorflow_text.python.ops import unicode_script_tokenizer
  File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/ops/__init__.py", line 25, in <module>
    from tensorflow_text.python.ops.bert_tokenizer import BertTokenizer
  File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/ops/bert_tokenizer.py", line 28, in <module>
    from tensorflow_text.python.ops import regex_split_ops
  File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/python/ops/regex_split_ops.py", line 23, in <module>
    gen_regex_split_ops = load_library.load_op_library(resource_loader.get_path_to_datafile('_regex_split_ops.so'))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorflow/python/framework/load_library.py", line 54, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python3.12/dist-packages/tensorflow_text/python/ops/_regex_split_ops.so: undefined symbol: _ZN10tensorflow12OpDefBuilder10SetShapeFnESt8functionIFN4absl12lts_202501276StatusEPNS_15shape_inference16InferenceContextEEE

Here is the snapshot on the tensorflow and tensorflow related packages:

Name: tensorflow-datasets
Version: 4.9.10
Summary: tensorflow/datasets is a library of datasets ready to use with TensorFlow.
Home-page: https://github.com/tensorflow/datasets
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: absl-py, array_record, dm-tree, etils, immutabledict, numpy, promise, protobuf, psutil, pyarrow, requests, simple_parsing, tensorflow-metadata, termcolor, toml, tqdm, wrapt
Required-by: seqio
---
Name: tensorflow-text
Version: 2.20.1
Summary: TF.Text is a TensorFlow library of text related ops, modules, and subgraphs.
Home-page: http://github.com/tensorflow/text
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: tensorflow
Required-by: seqio
---
Name: protobuf
Version: 6.33.6
Summary: 
Home-page: https://developers.google.com/protocol-buffers/
Author: protobuf@googlegroups.com
Author-email: protobuf@googlegroups.com
License: 3-Clause BSD License
Location: /usr/local/lib/python3.12/dist-packages
Requires: 
Required-by: google-api-core, google-cloud-storage-control, googleapis-common-protos, grain, grpc-google-iam-v1, grpcio-status, nsys-jax, orbax-checkpoint, proto-plus, tensorboard, tensorboardX, tensorflow-datasets, tensorflow-metadata, tf_nightly_cpu, xprof
---
Name: tf_nightly_cpu
Version: 2.22.0.dev20260530
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: absl-py, astunparse, flatbuffers, gast, google_pasta, grpcio, h5py, keras-nightly, libclang, ml_dtypes, numpy, opt_einsum, packaging, protobuf, requests, setuptools, six, termcolor, typing_extensions, wrapt
Required-by: 

There is no good prebuilt tensorflow-text match with the current tf-nightly.

  • I tried to install tensorflow_text-nightly but here we have
Traceback (most recent call last):
  File "/usr/local/bin/fuji-train-perf.py", line 9, in <module>
    from axlearn.experiments.text.gpt import c4_trainer
  File "/opt/axlearn/axlearn/experiments/text/gpt/__init__.py", line 5, in <module>
    from axlearn.experiments.text.gpt import (  # pytype: disable=pyi-error
  File "/opt/axlearn/axlearn/experiments/text/gpt/c4_trainer.py", line 49, in <module>
    from axlearn.common.input_lm import lm_text_preprocessor
  File "/opt/axlearn/axlearn/common/input_lm.py", line 12, in <module>
    import seqio
  File "/usr/local/lib/python3.12/dist-packages/seqio/__init__.py", line 19, in <module>
    from seqio.dataset_providers import *
  File "/usr/local/lib/python3.12/dist-packages/seqio/dataset_providers.py", line 41, in <module>
    from seqio import metrics as metrics_lib
  File "/usr/local/lib/python3.12/dist-packages/seqio/metrics.py", line 27, in <module>
    from seqio import utils
  File "/usr/local/lib/python3.12/dist-packages/seqio/utils.py", line 29, in <module>
    from seqio.vocabularies import Vocabulary
  File "/usr/local/lib/python3.12/dist-packages/seqio/vocabularies.py", line 26, in <module>
    import tensorflow_text as tf_text
  File "/usr/local/lib/python3.12/dist-packages/tensorflow_text/__init__.py", line 20, in <module>
    from tensorflow_text.core.pybinds import tflite_registrar
ModuleNotFoundError: No module named 'tensorflow_text.core'
  • I tried to install tensorflow-text from source but we have bazel toolchain error:
Starting local Bazel server and connecting to it...
INFO: Reading 'startup' options from /opt/text/.bazelrc: --windows_enable_symlinks
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'run' from /opt/text/.bazelrc:
  Inherited 'common' options: --announce_rc --experimental_cc_shared_library --experimental_link_static_libraries_once=false --incompatible_enforce_config_setting_visibility --noenable_bzlmod --noincompatible_enable_cc_toolchain_resolution --noincompatible_enable_android_toolchain_resolution --experimental_repo_remote_exec --java_runtime_version=remotejdk_21
INFO: Reading rc options for 'run' from /opt/text/.bazelrc:
  Inherited 'build' options: --repo_env=ML_WHEEL_TYPE=snapshot --repo_env=ML_WHEEL_BUILD_DATE= --repo_env=ML_WHEEL_VERSION_SUFFIX= --define framework_shared_object=true --define tsl_protobuf_header_only=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --repo_env=USE_PYWRAP_RULES=True --copt=-DGRPC_BAZEL_BUILD --host_copt=-DGRPC_BAZEL_BUILD --action_env=GRPC_BAZEL_RUNTIME=1 --repo_env=PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=upb --action_env=PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=upb --repo_env=RULES_PYTHON_ENABLE_PYSTAR=0 --define=grpc_no_ares=true --features=-force_no_whole_archive --host_features=-force_no_whole_archive --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2 --@rules_python//python/config_settings:precompile=force_disabled
INFO: Found applicable config definition build:short_logs in file /opt/text/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /opt/text/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition build:linux in file /opt/text/.bazelrc: --host_copt=-w --copt=-Wno-all --copt=-Wno-extra --copt=-Wno-deprecated --copt=-Wno-deprecated-declarations --copt=-Wno-ignored-attributes --copt=-Wno-array-bounds --copt=-Wunused-result --copt=-Werror=unused-result --copt=-Wswitch --copt=-Werror=switch --define=PREFIX=/usr --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --config=dynamic_kernels --experimental_guard_against_concurrent_changes
INFO: Found applicable config definition build:dynamic_kernels in file /opt/text/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
DEBUG: /root/.cache/bazel/_bazel_root/b4f968c26af22407e6171bb52049003f/external/org_tensorflow/third_party/py/python_repo.bzl:82:14: !!!Using pywrap rules instead of directly creating .so objects!!!
DEBUG: /root/.cache/bazel/_bazel_root/b4f968c26af22407e6171bb52049003f/external/org_tensorflow/third_party/py/python_repo.bzl:87:10: 
=============================
Hermetic Python configuration:
Version: "3.12"
Kind: ""
Interpreter: "default" (provided by rules_python)
Requirements_lock label: "@//oss_scripts/pip_package:requirements_lock_3_12.txt"
=====================================
Computing main repo mapping: 
DEBUG: /root/.cache/bazel/_bazel_root/b4f968c26af22407e6171bb52049003f/external/org_tensorflow/third_party/repo.bzl:132:14: 
Warning: skipping import of repository 'icu' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/b4f968c26af22407e6171bb52049003f/external/org_tensorflow/third_party/repo.bzl:132:14: 
Warning: skipping import of repository 'build_bazel_apple_support' because it already exists.
DEBUG: /root/.cache/bazel/_bazel_root/b4f968c26af22407e6171bb52049003f/external/org_tensorflow/third_party/repo.bzl:132:14: 
Warning: skipping import of repository 'pybind11' because it already exists.
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
Computing main repo mapping: 
WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/protocolbuffers/protobuf/archive/refs/tags/v5.28.3.zip failed: class java.io.FileNotFoundException GET returned 404 Not Found
Computing main repo mapping: 
Computing main repo mapping: 
ERROR: Error computing the main repository mapping: Label '@@rules_ml_toolchain//cc/deps:cc_toolchain_deps.bzl' is invalid because 'cc/deps' is not a package; perhaps you meant to put the colon here: '@@rules_ml_toolchain//:cc/deps/cc_toolchain_deps.bzl'?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants