Skip to content

Add termination condition based on percentage of visited tiles for Car Racing #1323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

VincenzoPalma
Copy link

Description

The environment will now end with terminated = True when the lap is completed after reaching the specified percentage of visited tiles.

Fixes #1269

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • This change requires a documentation update

Checklist:

  • I have made corresponding changes to the documentation

@pseudo-rnd-thoughts
Copy link
Member

@VincenzoPalma To clarify what was the previous behaviour when the agent had cross that the lap completion percentage and got to the end?

Could you train an agent using PPO from SB3 and share the training graphes of the old and new versions?

@VincenzoPalma
Copy link
Author

@pseudo-rnd-thoughts The previous behavior in that scenario was that the environment would not terminate upon completing the lap but would instead continue until reaching the time limit, at which point it would end with truncated = True.

I will try and train an agent as you suggest as soon as i can.

@pseudo-rnd-thoughts
Copy link
Member

@VincenzoPalma Have you had any time to train agents for the different Car Racing versions?

@VincenzoPalma
Copy link
Author

I've only had a few days to work on this so far, and since it's my first time using SB3, it's taking a bit longer. I've obtained some training graphs for 25, 50, 75 and 90 percentage of track covered, but something seems off, so I'm conducting more in depth testing.

@pseudo-rnd-thoughts
Copy link
Member

Thanks for doing that @VincenzoPalma, keep me updated here or on discord if you are uncertain how to get some working

@VincenzoPalma
Copy link
Author

VincenzoPalma commented Mar 19, 2025

Can i share TensorBoard logs files to show you the training graphs?

@pseudo-rnd-thoughts
Copy link
Member

You can either share images on GitHub or message me on discord and that I can look at the files

@VincenzoPalma
Copy link
Author

VincenzoPalma commented Mar 19, 2025

I'll share some images as soon as I obtain the graphs with 75 as minimum percentage of visited tiles. I'll also share the code to see if it's correct for the task and to receive feedback. If it's good i'll get the train graphs for the other percentages.

@VincenzoPalma
Copy link
Author

So, here's the code that i used:

import torch
from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.callbacks import EvalCallback, CheckpointCallback
from stable_baselines3.common.logger import configure

import gymnasium as gym

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

n_envs = 16

def make_env():
    env = gym.make("CarRacing-v3", lap_complete_percent=0.25)
    return Monitor(env)

env = DummyVecEnv([make_env for _ in range(n_envs)])

log_dir = "./logs/"
new_logger = configure(log_dir, ["tensorboard"])

new_model = PPO("CnnPolicy", env, verbose=1, tensorboard_log=log_dir, device=device)
new_model.set_logger(new_logger)

eval_callback = EvalCallback(env, best_model_save_path='./logs/best_model',
                             log_path='./logs/results', eval_freq=10000,
                             deterministic=True, render=False)

checkpoint_callback = CheckpointCallback(save_freq=10000, save_path='./logs/',
                                         name_prefix='ppo_model')

new_model.learn(total_timesteps=500_000)

new_model_name = "ppo_car_gray_new252"
new_model.save(new_model_name)
print(f"New model saved as {new_model_name}")

So far, I've trained two agents: one before the change and one after, both with 75 lap complete percent.
Here are some graphs of the agent before the change:
image
image

Graphs of the agent after the change:
image
image

What stands out the most to me is the difference in the mean reward per episode. It looks better in the new version of the game, probably because the game now ends correctly before the agent can take any action that would give it a negative reward.

I'll wait for your feedback on the code and the data.

@pseudo-rnd-thoughts
Copy link
Member

Thanks for the graphes @VincenzoPalma, overall I'm surprised that the episode reward is always negative. This might be feature of the environment but I would have expected that the environment could be positive.
Looking at the SB3 benchmarks (https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/benchmark.md), for CarRacing v0 they get 800 and 150 roughly however this is for v0, do you know why that could be?

@VincenzoPalma
Copy link
Author

My initial assumption is that 500k steps might not be sufficient to achieve a positive reward. I initially set 500k steps mainly to compare the agents before and after the change, but I could try increasing it to a value more in line with the numbers used in the benchmarks you shared to see if it yields better results.

@pseudo-rnd-thoughts
Copy link
Member

Here are some tuned hyperparmeters for Car Racing v2 that you could test.

https://github.com/DLR-RM/rl-baselines3-zoo/blob/e00c5c83447e81ab4936b80a61a31a2109485498/hyperparams/ppo.yml#L350

However, the SB3 logs doesn't have a v2 only v0 (https://huggingface.co/sb3/ppo-CarRacing-v0)

@VincenzoPalma
Copy link
Author

I trained an agent (75% and new version of the game) using those hyperparameters but keeping 500k steps and i got the expected results:
image
Now, should i train agents for both versions and all percentages using these hyperparameters? Also can i keep using 500k steps for time purposes?

@pseudo-rnd-thoughts
Copy link
Member

Ohh that is way better results. Yeah, if you can do that, use the same hyperparameters and 3 different percentages with both the old and new environments to show the differences.
We want to demonstrate that the environment changes make sense and don't break something unexpectedly

@VincenzoPalma
Copy link
Author

VincenzoPalma commented Mar 23, 2025

Here's the train graphs for 25, 50 and 75 percentages. Looks like old version achieves better rewards at 25 and 50 and similar rewards at 75. Could this be because the old version ends the episode less frequently, giving the car more opportunities to discover unexplored tiles?
image
image
image
image
image
image

@AUnicyclingProgrammer
Copy link

AUnicyclingProgrammer commented Mar 25, 2025

I've been following your progress and wanted to pass along something that I noticed when I was training my agent.

I think that v3 of the environment provides a better training environment than our current test version.

When I was training my agent, I considered modifying the environment such that every time the agent crossed the start/finish line the environment checked to see if the agent had covered enough tiles to consider the lap complete, but didn't terminate the race unless the agent had covered enough tiles. My thinking was that this would help the agent learn to stay on the road because the agent could continue learning for the entirety of the time limit. If you started training with a generous time limit (say 5000 steps) an agent could still "complete the race" even if it took 2 or 3 laps to do so.

This would require modifying the truncation and termination conditions we defined in #1269 but I think it'd also speed up training and improve agent performance. My concern is that these changes may be significant enough to create a fundamental difference between the way the current versions of the environment operate compared to future versions were this change to be implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug Report] "CarRacing-v3" Appears to Reset before Completing a Lap causing it to Ignore lap_complete_percent
3 participants