Skip to content

model.fit() and eager_tf generates different training results #384

@silvaurus

Description

@silvaurus

Hello!

I didn't change the code and use both model.fit() and eager_tf to train the network.

For model.fit() the avg validation loss value is < 50 even in the first epoch. And the training loss value also goes < 50 in the beginning of the second epoch.

For eager_tf the validation loss stays at ~ 200 after 10 epochs, and the training loss decreases much slower, and goes to ~50 in the 10th epoch, which looks like overfitting.

This is the training result for model.fit():
Epoch 1:
1/358

  • loss: 9787.6289 - yolo_output_0_loss: 508.0005 - yolo_output_1_loss: 1342.9556 - yolo_output_2_loss: 7925.9561

...

357/358

  • loss: 378.2877 - yolo_output_0_loss: 22.6362 - yolo_output_1_loss: 49.9713 - yolo_output_2_loss: 294.6154

358/358

  • loss: 378.0025 - yolo_output_0_loss: 22.6236 - yolo_output_1_loss: 49.9357 - yolo_output_2_loss: 294.3785

val_loss: 51.9096 - val_yolo_output_0_loss: 8.8620 - val_yolo_output_1_loss: 7.8781 - val_yolo_output_2_loss: 24.0912

Epoch 2:
1/358

  • loss: 43.6244 - yolo_output_0_loss: 6.2404 - yolo_output_1_loss: 8.0534 - yolo_output_2_loss: 18.2523

Notice this sudden transition of training loss from 378 to 43 - this is because model.fit() reports the average among all the iterations in one batch.

This is the training result for eager_tf:
1_train_0, 155262.8125, [5675.242, 34116.484, 115460.375]
...

1_train_356, 523.5953369140625, [124.26721, 100.35405, 287.8407]
1_train_357, 125.0768814086914, [25.127472, 11.3394575, 77.47637]
1_val_0, 565.5044555664062, [86.86941, 158.40671, 309.0946]
...
1_val_363, 694.1661987304688, [114.45209, 213.89682, 354.6836]

(Average) 1, train: 5050.33447265625, val: 590.8134155273438

2_train_0, 788.0953369140625, [132.88559, 241.86014, 402.21585]
2_train_1, 493.3677978515625, [86.920746, 157.22601, 238.08711]

Notice that here the losses are per-iteration losses and are not averaged.
ever since the first iteration, the loss values are much bigger than model.fit(), and at the end of epoch 1, the loss is >100, which is much worse compared with < 50 in model.fit().

I strictly follow the tutorial used for training and used the datasets / darknet model downloaded directly from the links provided.

I guess this might relate to the different process of loss functions.
Do you by any chance know why?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions