- 
                Notifications
    
You must be signed in to change notification settings  - Fork 11.7k
 
Description
I am training a Keras Tensorflow ResNet50 model in my Nvidia RTX4090 GPU. I am using Python 3.10, TF 2.10 (I am using Windows), Keras 2.10, CUDA 12.5, CuDNN 8.9 and PyCharm as my interpreter. The model usually takes around 2 minutes per epoch. However, I have observed that sometimes, without changing any hiperparameters and with exactly the same inputs, an epoch can take up to an hour. Moreover, this sometimes happen within the same run: the first epoch takes 15 minutes but the other epochs take 2 minutes, or the model runs at normal speed for three epochs and the fourth epoch takes over half an hour. At the beginning of the code I'm using tf.config.experimental.set_memory_growth(device, True) and tf.keras.backend.clear_session(), and I have checked the GPU usage and its the same in both cases. I am new to Machine Learning and Keras, so is there anything I'm missing?