-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Hello!
I'm encountering an error when running the code, consistently across both the MNIST and CIFAR-10 datasets. Regardless of the configures I use (including the config files in train_configs directory), it reports something wrong stating "Layer xx is NaN!" for each layer. Additionally, I receive a warning that says "WARNING:tensorboardX.x2num: NaN or Inf found in input tensor."
Layer 0 is NaN!
Layer 1 is NaN!
Layer 2 is NaN!
Layer 3 is NaN!
Layer 4 is NaN!
Layer 5 is NaN!
Layer 6 is NaN!
Layer 7 is NaN!
Layer 8 is NaN!
Layer 9 is NaN!
Layer 0 is NaN!
Layer 1 is NaN!
Layer 2 is NaN!
Layer 3 is NaN!
Layer 4 is NaN!
Layer 5 is NaN!
Layer 6 is NaN!
Layer 7 is NaN!
Layer 8 is NaN!
Layer 9 is NaN!
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
round= 12 test_accuracy= 0.1046875 adv_success= 0 test_loss= nan duration= 2.9140660762786865
DEBUG:root:Memory info: 6584934400
Here is my mnist_setup.yml file for MNIST dataset:
---
client:
benign_training:
batch_size: 64
learning_rate: 0.02
num_epochs: 2
optimizer: SGD
step_decay: true
debug_client_training: false
optimized_training: true
# clip:
# type: l2
# value: 10
model_name: lenet5_mnist
# quantization:
# type: probabilistic
# bits: 8
# frac: 7
dataset:
# augment_data: false
data_distribution: IID
dataset: mnist
environment:
experiment_name: lenet5_mnist
# load_model: ../models/resnet18.h5
num_clients: 48
num_malicious_clients: 0
num_selected_clients: 6
use_config_dir: true
print_every: 1
job:
cpu_cores: 20
cpu_mem_per_core: 4096
gpu_memory_min: 10240
minutes: 10
use_gpu: 1
server:
aggregator:
name: FedAvg
global_learning_rate: 1
num_rounds: 35
num_test_batches: 20
...
And this is my mnist_setup.yml file for CIFAR-10 dataset:
---
client:
benign_training:
batch_size: 64
learning_rate: 0.02
num_epochs: 2
optimizer: SGD
step_decay: true
debug_client_training: false
optimized_training: true
# clip:
# type: l2
# value: 10
model_name: lenet5_cifar
# quantization:
# type: probabilistic
# bits: 8
# frac: 7
dataset:
# augment_data: false
data_distribution: IID
dataset: cifar10
environment:
experiment_name: lenet5_cifar
# load_model: /home/hujia/fl-analysis/models/resnet18.h5
num_clients: 48
num_malicious_clients: 0
num_selected_clients: 6
use_config_dir: true
print_every: 1
job:
cpu_cores: 20
cpu_mem_per_core: 4096
gpu_memory_min: 10240
minutes: 10
use_gpu: 1
server:
aggregator:
name: FedAvg
global_learning_rate: 1
num_rounds: 35
num_test_batches: 20
...
I suspect that the issue might stem from an incorrect version of a package in my environment configuration, but what confuses me is that the code runs correctly with the Shakespeare dataset.
Metadata
Metadata
Assignees
Labels
No labels