Skip to content

Layer xx is NaN! #4

@Jiahu235

Description

@Jiahu235

Hello!
I'm encountering an error when running the code, consistently across both the MNIST and CIFAR-10 datasets. Regardless of the configures I use (including the config files in train_configs directory), it reports something wrong stating "Layer xx is NaN!" for each layer. Additionally, I receive a warning that says "WARNING:tensorboardX.x2num: NaN or Inf found in input tensor."

Layer 0 is NaN!
Layer 1 is NaN!
Layer 2 is NaN!
Layer 3 is NaN!
Layer 4 is NaN!
Layer 5 is NaN!
Layer 6 is NaN!
Layer 7 is NaN!
Layer 8 is NaN!
Layer 9 is NaN!
Layer 0 is NaN!
Layer 1 is NaN!
Layer 2 is NaN!
Layer 3 is NaN!
Layer 4 is NaN!
Layer 5 is NaN!
Layer 6 is NaN!
Layer 7 is NaN!
Layer 8 is NaN!
Layer 9 is NaN!
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
round= 12       test_accuracy= 0.1046875        adv_success= 0  test_loss= nan  duration= 2.9140660762786865
DEBUG:root:Memory info: 6584934400

Here is my mnist_setup.yml file for MNIST dataset:

---
client:
    benign_training:
        batch_size: 64
        learning_rate: 0.02
        num_epochs: 2
        optimizer: SGD
        step_decay: true
    debug_client_training: false

    optimized_training: true
    # clip:
    #    type: l2
    #    value: 10
    model_name: lenet5_mnist
#    quantization:
#        type: probabilistic
#        bits: 8
#        frac: 7
dataset:
    # augment_data: false
    data_distribution: IID
    dataset: mnist
environment:
    experiment_name: lenet5_mnist
    # load_model: ../models/resnet18.h5
    num_clients: 48
    num_malicious_clients: 0
    num_selected_clients: 6
    use_config_dir: true
    print_every: 1
job:
    cpu_cores: 20
    cpu_mem_per_core: 4096
    gpu_memory_min: 10240
    minutes: 10
    use_gpu: 1
server:
    aggregator:
        name: FedAvg
    global_learning_rate: 1
    num_rounds: 35
    num_test_batches: 20
...

And this is my mnist_setup.yml file for CIFAR-10 dataset:

---
client:
    benign_training:
        batch_size: 64
        learning_rate: 0.02
        num_epochs: 2
        optimizer: SGD
        step_decay: true
    debug_client_training: false

    optimized_training: true
    # clip:
    #    type: l2
    #    value: 10
    model_name: lenet5_cifar
#    quantization:
#        type: probabilistic
#        bits: 8
#        frac: 7
dataset:
    # augment_data: false
    data_distribution: IID
    dataset: cifar10
environment:
    experiment_name: lenet5_cifar
    # load_model: /home/hujia/fl-analysis/models/resnet18.h5
    num_clients: 48
    num_malicious_clients: 0
    num_selected_clients: 6
    use_config_dir: true
    print_every: 1
job:
    cpu_cores: 20
    cpu_mem_per_core: 4096
    gpu_memory_min: 10240
    minutes: 10
    use_gpu: 1
server:
    aggregator:
        name: FedAvg
    global_learning_rate: 1
    num_rounds: 35
    num_test_batches: 20
...

I suspect that the issue might stem from an incorrect version of a package in my environment configuration, but what confuses me is that the code runs correctly with the Shakespeare dataset.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions