Layer xx is NaN!

Hello! 
I'm encountering an error when running the code, consistently across both the MNIST and CIFAR-10 datasets. Regardless of the configures I use (including the config files in `train_configs` directory), it reports something wrong stating "Layer xx is NaN!" for each layer. Additionally, I receive a warning that says "WARNING:tensorboardX.x2num: NaN or Inf found in input tensor."
```
Layer 0 is NaN!
Layer 1 is NaN!
Layer 2 is NaN!
Layer 3 is NaN!
Layer 4 is NaN!
Layer 5 is NaN!
Layer 6 is NaN!
Layer 7 is NaN!
Layer 8 is NaN!
Layer 9 is NaN!
Layer 0 is NaN!
Layer 1 is NaN!
Layer 2 is NaN!
Layer 3 is NaN!
Layer 4 is NaN!
Layer 5 is NaN!
Layer 6 is NaN!
Layer 7 is NaN!
Layer 8 is NaN!
Layer 9 is NaN!
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
round= 12       test_accuracy= 0.1046875        adv_success= 0  test_loss= nan  duration= 2.9140660762786865
DEBUG:root:Memory info: 6584934400
```
Here is my `mnist_setup.yml` file for MNIST dataset:
```
---
client:
    benign_training:
        batch_size: 64
        learning_rate: 0.02
        num_epochs: 2
        optimizer: SGD
        step_decay: true
    debug_client_training: false

    optimized_training: true
    # clip:
    #    type: l2
    #    value: 10
    model_name: lenet5_mnist
#    quantization:
#        type: probabilistic
#        bits: 8
#        frac: 7
dataset:
    # augment_data: false
    data_distribution: IID
    dataset: mnist
environment:
    experiment_name: lenet5_mnist
    # load_model: ../models/resnet18.h5
    num_clients: 48
    num_malicious_clients: 0
    num_selected_clients: 6
    use_config_dir: true
    print_every: 1
job:
    cpu_cores: 20
    cpu_mem_per_core: 4096
    gpu_memory_min: 10240
    minutes: 10
    use_gpu: 1
server:
    aggregator:
        name: FedAvg
    global_learning_rate: 1
    num_rounds: 35
    num_test_batches: 20
...
```
And this is my `mnist_setup.yml` file for CIFAR-10 dataset:
```
---
client:
    benign_training:
        batch_size: 64
        learning_rate: 0.02
        num_epochs: 2
        optimizer: SGD
        step_decay: true
    debug_client_training: false

    optimized_training: true
    # clip:
    #    type: l2
    #    value: 10
    model_name: lenet5_cifar
#    quantization:
#        type: probabilistic
#        bits: 8
#        frac: 7
dataset:
    # augment_data: false
    data_distribution: IID
    dataset: cifar10
environment:
    experiment_name: lenet5_cifar
    # load_model: /home/hujia/fl-analysis/models/resnet18.h5
    num_clients: 48
    num_malicious_clients: 0
    num_selected_clients: 6
    use_config_dir: true
    print_every: 1
job:
    cpu_cores: 20
    cpu_mem_per_core: 4096
    gpu_memory_min: 10240
    minutes: 10
    use_gpu: 1
server:
    aggregator:
        name: FedAvg
    global_learning_rate: 1
    num_rounds: 35
    num_test_batches: 20
...
```
I suspect that the issue might stem from an incorrect version of a package in my environment configuration, but what confuses me is that the code runs correctly with the Shakespeare dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Layer xx is NaN! #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Layer xx is NaN! #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions