Can't reproduce experimental results

I tried to run probing tasks for different Atari environments, using the following command: 

`python -m scripts.run_probe --method infonce-stdim --env-name {env_name}`

I did not change any code, just tried different game, including `PongNoFrameskip-v4`, `BowlingNoFrameskip-v4`, `BreakoutNoFrameskip-v4`, `HeroNoFrameskip-v4`. 

However, only the F1 score for `pong` matches the score reported in the paper. The F1 scores of the other three games are far worse than the score shown in the paper (for `bowling`, I got 0.22).

I check the training loss logged in wandb, it seems that training has not converged at all. See the figure below.

![training loss](https://tva1.sinaimg.cn/large/008eGmZEgy1gn16hi89crj328m0u043u.jpg)

How to get the F1 socres reported in the paper? Am I missing something?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't reproduce experimental results #69

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can't reproduce experimental results #69

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions