LSTM - different outputs for same weights across CPU and GPU, when using float32 + tf-keras + NVIDIA A100

**System information**
- Custom Code: YES
- OS: SUSE Linux Enterprise High Performance Computing 15 SP5
- TensorFlow installed from: DOCKER (tensorflow/tensorflow:2.16.1-gpu-jupyter)
- TensorFlow version: v2.16.1-0-g5bc9d26649c 2.16.1
- Python version: 3.11
- GPU model and memory: NVIDIA A100-PCIE-40GB
- Code to reproduce: find below

**Describe the problem**
I have a model comprising almost entirely of LSTM layers. If I load the same weights into a copy of the model instanced to run on CPU and GPU, results are different.

This **issue disappears** (the GPU results change to match CPU) if I change any of these:
- Move from 
    - _SLES + NVIDIA A100 +  Driver Version: 550.54.14 + CUDA Version: 12.4_ 
    to
    - _Ubuntu 22.04.4 LTS NVIDIA V100 +  Driver Version: 535.161.07 + CUDA Version: 12.2_
- Set keras.backend.set_floatx('float64')
- Use keras 3 instead of tf-keras

In all these cases, I'm running the same (official) docker image, in which my only modification has been to install tf-keras==2.16.0 and plotly.

**Standalone code to reproduce the issue**.
```
!pip install plotly
!pip install tf-keras==2.16.0
```

```
import os
import tensorflow as tf

import numpy as np

USE_TF_KERAS = True

if USE_TF_KERAS:
    import tf_keras as keras
    from tf_keras import layers
    from tf_keras import initializers
    from tf_keras import backend as K
else:
    import keras
    from keras import layers
    from keras import initializers
    from keras import backend as K

# Setting float64 as default dtype removes the discrepancy between CPU and GPU!
# keras.backend.set_floatx('float64')
from plotly import graph_objects as go

ROOT_DIR = os.getcwd()

n_time_steps = 800

theta = np.linspace(0, 2 * np.pi, n_time_steps).reshape(1, -1)

np.random.seed(42)
tf.random.set_seed(42)
dummy_input_dict = {
    "input_a": 800
    * np.stack((np.cos(theta), np.sin(theta)), axis=-1).astype(np.float32),
    "input_b": np.random.rand(1, n_time_steps, 5).astype(np.float32),
}


def build_model():
    input_a = layers.Input(shape=(n_time_steps, 2), name="input_a")
    input_b = layers.Input(shape=(n_time_steps, 5), name="input_b")

    x = layers.Concatenate()([input_a, input_b])
    for idx in range(8):
        lstm_layer = layers.LSTM(
                1024,
                kernel_initializer=initializers.RandomNormal(seed=42 + idx),
                recurrent_initializer=initializers.RandomNormal(seed=52 + idx),
                return_sequences=True,
            )
        x = lstm_layer(x)
    y = layers.Dense(1)(x)
    model = keras.Model(inputs=[input_a, input_b], outputs=y)

    return model


def main(device):
    with tf.device(device):
        model = build_model()
        model.load_weights("my_initial_weights.h5")

        features = ["input_a", "input_b"]
        dummy_input = [dummy_input_dict[k] for k in features]
        preds = model.predict(dummy_input)

    return preds

# Save one set of weights, so that we can compare the weights of the two models
with tf.device("/device:CPU:0"):
    model = build_model()
    model.save_weights("my_initial_weights.h5")


tf.config.list_logical_devices()

cpu_preds = main("/device:CPU:0")
gpu_preds = main("/device:GPU:0")

cpu_output = cpu_preds[0, :, 0]
gpu_output = gpu_preds[0, :, 0]

fig = go.Figure()
fig.add_trace(go.Scatter(y=cpu_output, name="CPU"))
fig.add_trace(go.Scatter(y=gpu_output, name="GPU"))
fig.show()
```

Resulting plot:

![immagine](https://github.com/keras-team/tf-keras/assets/143824846/d7d4dd55-32e1-4d5b-98c9-e6a215704f94)

As mentioned at the beginning:
- changing host to my V100 host
- uncommenting `# keras.backend.set_floatx('float64')`
- setting `USE_TF_KERAS = False`

All workaround the issue, and the GPU prediction matches the CPU prediction.

I also re-iterate that all of this has been run in the official `tensorflow/tensorflow:2.16.1-gpu-jupyter` container, on both hosts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSTM - different outputs for same weights across CPU and GPU, when using float32 + tf-keras + NVIDIA A100 #772

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LSTM - different outputs for same weights across CPU and GPU, when using float32 + tf-keras + NVIDIA A100 #772

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions