model.fit() and eager_tf generates different training results

Hello!

I didn't change the code and use both model.fit() and eager_tf to train the network.

For model.fit() the avg validation loss value is < 50 even in the first epoch. And the training loss value also goes < 50 in the beginning of the second epoch.

For eager_tf the validation loss stays at ~ 200 after 10 epochs, and the training loss decreases much slower, and goes to  ~50 in the 10th epoch, which looks like overfitting.

This is the training result for model.fit():
Epoch 1:
1/358 
- loss: 9787.6289 - yolo_output_0_loss: 508.0005 - yolo_output_1_loss: 1342.9556 - yolo_output_2_loss: 7925.9561

...

357/358
- loss: 378.2877 - yolo_output_0_loss: 22.6362 - yolo_output_1_loss: 49.9713 - yolo_output_2_loss: 294.6154

358/358
- loss: 378.0025 - yolo_output_0_loss: 22.6236 - yolo_output_1_loss: 49.9357 - yolo_output_2_loss: 294.3785

val_loss: 51.9096 - val_yolo_output_0_loss: 8.8620 - val_yolo_output_1_loss: 7.8781 - val_yolo_output_2_loss: 24.0912

Epoch 2:
1/358
- loss: 43.6244 - yolo_output_0_loss: 6.2404 - yolo_output_1_loss: 8.0534 - yolo_output_2_loss: 18.2523

Notice this sudden transition of training loss from 378 to 43 - this is because model.fit() reports the average among all the iterations in one batch.

This is the training result for eager_tf:
1_train_0, 155262.8125, [5675.242, 34116.484, 115460.375]
...

1_train_356, 523.5953369140625, [124.26721, 100.35405, 287.8407]
1_train_357, 125.0768814086914, [25.127472, 11.3394575, 77.47637]
1_val_0, 565.5044555664062, [86.86941, 158.40671, 309.0946]
...
1_val_363, 694.1661987304688, [114.45209, 213.89682, 354.6836]

(Average) 1, train: 5050.33447265625, val: 590.8134155273438

2_train_0, 788.0953369140625, [132.88559, 241.86014, 402.21585]
2_train_1, 493.3677978515625, [86.920746, 157.22601, 238.08711]

Notice that here the losses are per-iteration losses and are not averaged.
ever since the first iteration, the loss values are much bigger than model.fit(), and at the end of epoch 1, the loss is >100, which is much worse compared with < 50 in model.fit().

I strictly follow the tutorial used for training and used the datasets / darknet model downloaded directly from the links provided.

I guess this might relate to the different process of loss functions.
Do you by any chance know why?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

model.fit() and eager_tf generates different training results #384

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

model.fit() and eager_tf generates different training results #384

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions