Add termination condition based on percentage of visited tiles for Car Racing #1323

VincenzoPalma · 2025-02-28T11:43:09Z

Description

The environment will now end with terminated = True when the lap is completed after reaching the specified percentage of visited tiles.

Fixes #1269

Type of change

Bug fix (non-breaking change which fixes an issue)
This change requires a documentation update

Checklist:

I have made corresponding changes to the documentation

…tatement in the _contact function.

pseudo-rnd-thoughts · 2025-02-28T21:51:10Z

@VincenzoPalma To clarify what was the previous behaviour when the agent had cross that the lap completion percentage and got to the end?

Could you train an agent using PPO from SB3 and share the training graphes of the old and new versions?

VincenzoPalma · 2025-02-28T22:08:40Z

@pseudo-rnd-thoughts The previous behavior in that scenario was that the environment would not terminate upon completing the lap but would instead continue until reaching the time limit, at which point it would end with truncated = True.

I will try and train an agent as you suggest as soon as i can.

pseudo-rnd-thoughts · 2025-03-19T13:29:54Z

@VincenzoPalma Have you had any time to train agents for the different Car Racing versions?

VincenzoPalma · 2025-03-19T13:41:21Z

I've only had a few days to work on this so far, and since it's my first time using SB3, it's taking a bit longer. I've obtained some training graphs for 25, 50, 75 and 90 percentage of track covered, but something seems off, so I'm conducting more in depth testing.

pseudo-rnd-thoughts · 2025-03-19T13:50:13Z

Thanks for doing that @VincenzoPalma, keep me updated here or on discord if you are uncertain how to get some working

VincenzoPalma · 2025-03-19T16:25:54Z

Can i share TensorBoard logs files to show you the training graphs?

pseudo-rnd-thoughts · 2025-03-19T22:07:49Z

You can either share images on GitHub or message me on discord and that I can look at the files

VincenzoPalma · 2025-03-19T22:32:29Z

I'll share some images as soon as I obtain the graphs with 75 as minimum percentage of visited tiles. I'll also share the code to see if it's correct for the task and to receive feedback. If it's good i'll get the train graphs for the other percentages.

VincenzoPalma · 2025-03-20T10:27:48Z

So, here's the code that i used:

import torch
from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.callbacks import EvalCallback, CheckpointCallback
from stable_baselines3.common.logger import configure

import gymnasium as gym

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

n_envs = 16

def make_env():
    env = gym.make("CarRacing-v3", lap_complete_percent=0.25)
    return Monitor(env)

env = DummyVecEnv([make_env for _ in range(n_envs)])

log_dir = "./logs/"
new_logger = configure(log_dir, ["tensorboard"])

new_model = PPO("CnnPolicy", env, verbose=1, tensorboard_log=log_dir, device=device)
new_model.set_logger(new_logger)

eval_callback = EvalCallback(env, best_model_save_path='./logs/best_model',
                             log_path='./logs/results', eval_freq=10000,
                             deterministic=True, render=False)

checkpoint_callback = CheckpointCallback(save_freq=10000, save_path='./logs/',
                                         name_prefix='ppo_model')

new_model.learn(total_timesteps=500_000)

new_model_name = "ppo_car_gray_new252"
new_model.save(new_model_name)
print(f"New model saved as {new_model_name}")

So far, I've trained two agents: one before the change and one after, both with 75 lap complete percent.
Here are some graphs of the agent before the change:

Graphs of the agent after the change:

What stands out the most to me is the difference in the mean reward per episode. It looks better in the new version of the game, probably because the game now ends correctly before the agent can take any action that would give it a negative reward.

I'll wait for your feedback on the code and the data.

pseudo-rnd-thoughts · 2025-03-20T13:03:37Z

Thanks for the graphes @VincenzoPalma, overall I'm surprised that the episode reward is always negative. This might be feature of the environment but I would have expected that the environment could be positive.
Looking at the SB3 benchmarks (https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/benchmark.md), for CarRacing v0 they get 800 and 150 roughly however this is for v0, do you know why that could be?

VincenzoPalma · 2025-03-20T13:14:52Z

My initial assumption is that 500k steps might not be sufficient to achieve a positive reward. I initially set 500k steps mainly to compare the agents before and after the change, but I could try increasing it to a value more in line with the numbers used in the benchmarks you shared to see if it yields better results.

pseudo-rnd-thoughts · 2025-03-20T15:55:35Z

Here are some tuned hyperparmeters for Car Racing v2 that you could test.

https://github.com/DLR-RM/rl-baselines3-zoo/blob/e00c5c83447e81ab4936b80a61a31a2109485498/hyperparams/ppo.yml#L350

However, the SB3 logs doesn't have a v2 only v0 (https://huggingface.co/sb3/ppo-CarRacing-v0)

VincenzoPalma · 2025-03-20T17:55:56Z

I trained an agent (75% and new version of the game) using those hyperparameters but keeping 500k steps and i got the expected results:

Now, should i train agents for both versions and all percentages using these hyperparameters? Also can i keep using 500k steps for time purposes?

pseudo-rnd-thoughts · 2025-03-23T14:33:03Z

Ohh that is way better results. Yeah, if you can do that, use the same hyperparameters and 3 different percentages with both the old and new environments to show the differences.
We want to demonstrate that the environment changes make sense and don't break something unexpectedly

VincenzoPalma · 2025-03-23T22:18:13Z

Here's the train graphs for 25, 50 and 75 percentages. Looks like old version achieves better rewards at 25 and 50 and similar rewards at 75. Could this be because the old version ends the episode less frequently, giving the car more opportunities to discover unexplored tiles?

AUnicyclingProgrammer · 2025-03-25T23:13:15Z

I've been following your progress and wanted to pass along something that I noticed when I was training my agent.

I think that v3 of the environment provides a better training environment than our current test version.

When I was training my agent, I considered modifying the environment such that every time the agent crossed the start/finish line the environment checked to see if the agent had covered enough tiles to consider the lap complete, but didn't terminate the race unless the agent had covered enough tiles. My thinking was that this would help the agent learn to stay on the road because the agent could continue learning for the entirety of the time limit. If you started training with a generous time limit (say 5000 steps) an agent could still "complete the race" even if it took 2 or 3 laps to do so.

This would require modifying the truncation and termination conditions we defined in #1269 but I think it'd also speed up training and improve agent performance. My concern is that these changes may be significant enough to create a fundamental difference between the way the current versions of the environment operate compared to future versions were this change to be implemented.

VincenzoPalma · 2025-05-28T17:57:07Z

@pseudo-rnd-thoughts Where can i find your Discord username? I'd like to message you and pick up this pull request again.

AUnicyclingProgrammer · 2025-05-29T01:50:14Z

I recently found the time to resume working on project I where I first discovered the issue #1269 and remembered why I submitted the issue in the first place. I first noticed that the environment was resetting prematurely because the info dictionary never contained the lap_finished key which it should have been returning according to the code. I now suspect that this issue may be related to Gymnasium itself as opposed to the Car Racing environment, but I haven't conducted any further tests to confirm this theory.

Additionally, I made some custom modifications to the environment which appear to accelerate/improve the training process.

Episodes are forcibly terminated if the agent doesn't make any progress (earn a positive reward) during 250 timesteps. No additional penalty is applied
If the agent remains stationary for 100 consecutive timesteps an additional reward of -0.0125 is applied each timestep until the agent moves again, at which point the 100 timestep timer "resets"
An additional reward -0.0125 is applied for each timestep that the agent is spinning out as determined by monitoring the ABS sensors (both front ABS sensors have a reading below 50 and both rear ABS sensors have a value above 150).

I originally implemented these changes to speed up the training process, but they may be unnecessary because I didn't realize I needed to normalize the agent rewards until recently.

AUnicyclingProgrammer · 2025-05-30T18:06:33Z

Edit: The cause of this issue has since been solved and was the result of a typo.

The linked comment has been updated to reflect this.

Original Post

@VincenzoPalma @pseudo-rnd-thoughts I did some additional testing and no longer think the odd reset behavior we're seeing is unique to Car Racing. I posted my experimental results in the original issue thread because that felt like the most appropriate place to put them.

AUnicyclingProgrammer · 2025-06-10T20:12:02Z

I did some additional testing and no longer think the odd reset behavior we're seeing is unique to Car Racing. I posted my experimental results in the original issue thread because that felt like the most appropriate place to put them.

Update: The odd reset behavior I observed in other environments was caused by a bug in my code. For more information see this issue.

VincenzoPalma added 3 commits February 27, 2025 18:14

Moved the if statement for a newly visited tile out of the if begin s…

9477c20

…tatement in the _contact function.

Merge branch 'Farama-Foundation:main' into main

f5577ef

Updated Car Racing documentation.

03b152e

Uh oh!

Add termination condition based on percentage of visited tiles for Car Racing #1323

Are you sure you want to change the base?

Add termination condition based on percentage of visited tiles for Car Racing #1323

Uh oh!

Conversation

VincenzoPalma commented Feb 28, 2025

Description

Type of change

Checklist:

Uh oh!

pseudo-rnd-thoughts commented Feb 28, 2025

Uh oh!

VincenzoPalma commented Feb 28, 2025

Uh oh!

pseudo-rnd-thoughts commented Mar 19, 2025

Uh oh!

VincenzoPalma commented Mar 19, 2025

Uh oh!

pseudo-rnd-thoughts commented Mar 19, 2025

Uh oh!

VincenzoPalma commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pseudo-rnd-thoughts commented Mar 19, 2025

Uh oh!

VincenzoPalma commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VincenzoPalma commented Mar 20, 2025

Uh oh!

pseudo-rnd-thoughts commented Mar 20, 2025

Uh oh!

VincenzoPalma commented Mar 20, 2025

Uh oh!

pseudo-rnd-thoughts commented Mar 20, 2025

Uh oh!

VincenzoPalma commented Mar 20, 2025

Uh oh!

pseudo-rnd-thoughts commented Mar 23, 2025

Uh oh!

VincenzoPalma commented Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AUnicyclingProgrammer commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VincenzoPalma commented May 28, 2025

Uh oh!

AUnicyclingProgrammer commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AUnicyclingProgrammer commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Edit: The cause of this issue has since been solved and was the result of a typo.

Original Post

Uh oh!

AUnicyclingProgrammer commented Jun 10, 2025

Uh oh!

Uh oh!

VincenzoPalma commented Mar 19, 2025 •

edited

Loading

VincenzoPalma commented Mar 19, 2025 •

edited

Loading

VincenzoPalma commented Mar 23, 2025 •

edited

Loading

AUnicyclingProgrammer commented Mar 25, 2025 •

edited

Loading

AUnicyclingProgrammer commented May 29, 2025 •

edited

Loading

AUnicyclingProgrammer commented May 30, 2025 •

edited

Loading