- Flow, Sumo, RLlib Installation
- Flow Documentation
- Flow tutorials
- More Flow documentation and MAS
- TraCi documentation
- Ray docs
- Ray repo
- SUMO documentation
- SUMO repo
- Change echo for "conda activate dmas" to support "source activate dmas" [X]
- Update README.md to include the activation of the env, and the approval of uninstalling flow
- Update README.md to include running the goal simulation.py
- Expand README.md to specify an appropriate OS (Ubuntu 18.04) [X]
- Expand README.md to include installation of git and navigation to the folder [X]
- Use flow original repo and override files with shell
- Install Flow dependencies, Sumo, RLlib
- BEWARE do not install flow in your python environment, we have a local copy of the repo in here that needs to be changed.
- Complete tutorials in FLOW, especially (0,1,3,5,6,8,11)
Using grid map from flow tutorial.
Unfortunately OSM maps do not work for their inability to programmatically generate routes, which are then use for checking if an agent has exited the map or not. This leads to the disappearing of agents during one episode, thus having an array of samples with different lengths for every agent (see this).
A workaround was introduced here but for that one step in which an agent is deleted and reintroduced in the map, an observation is missing. This leads to the same error from before.
A possible solution could be to fill holes in the observation array by adding a fake observation, but I don't know how this will influence the training.
We can either import map with OpenStreetMap or create a custom one with Scenario.
- Starting with a custom grid map [X]
- Importing scenario from openmap [X]
- Use lust scenario without traffic lights: not possible, switching to OSM networks
- Set inflows from random starting edges in the OSM network [X]
- Add GroningenOSM to the map_utils [x]
We can import a pre-made network as in tutorial 4, here's a list:
This approach has been discarded since there is no easy way to remove traffic lights (network geometry) form this imported scenarios. Using OSM instead.
Check out the environment tutorial for this part. This part covers the Autonomous agents using RL.
The custom environments must be placed in the envs dir and you should add it to the init file. There is already a customRL Agent which can be modified as you please.
I advise yuo to use the third tutorial to implement the agent (remember to set the DEGUB option to true in the Parameter file ) and, when you are ready, you can train it on the simulation.
For every custom agent the following function must be implemented:
Action space is used to tell the Agent it what can/cannot do. Notice that deceleration and acceleration are considered just one param
- Deceleration: using comfortable deceleration rate at 3.5 m/s^2 as stated here
- Acceleration: using 10m/s^2 (ferrari acceleration rate), should look into this wiki link which states that comfortable acceleration for 10 seconds is 20g (200m/s^2) and 10g for 1 minute
- Lane_changing: todo
- Message: todo, check what happens when AA are able to send a float (0-1) to neighbors cars
Define what cars know about each other (turning direction), if you go by neighbors check out getNeighbors in the TraCi documentation
Note that each observation should be scaled 0-1
- agent speed
- difference between lead speed and agent
- distance from leader
- difference between agent speed and follower
- distance from follower
- number of neighbors (not scaled obv)
- average neighbors speed
- average neighbors acceleration
- Messages from neighbors AA (as written in above)
The current reward is the sum of the following params:
- Cooperative: (system_delay + system_standstill_time) * cooperative_weight
- Selfish: agent_specific_delay
- Scaled jerk: the lower the better
If you wish to add more functions to the traci file you need to look into the Vehicle file which can be found inside your conda envs, for example mine is in:
/anaconda3/envs/dmas/lib/python3.6/site-packages/traci/_vehicle.py
Add differentiation between MADDPG for cooperative AA and DPG for selfish ones [X]
- Reward function for accelerated learning
- 1Safe, Efficient, and Comfortable Velocity Controlbased on Reinforcement Learning forAutonomous Driving
- Use gpu support [x]
- Print debug infos and training progress [x]
The evaluation functions are in the train_utils.py.
- Reward: min, max, mean; split for kind of agent [X]
- Delay: mean, max, min, standstill (mean jerk ?) [X]
- Actions :
- Acceleration: mean/max/min
- Same as step
- Env params [x]
This section will be about research topics we will add to the project. For every issue we will address a paper with a brief description of how the issue can be addressed as well as a reference to the paper itself.
For this kind of issue take a look at the ray's tune framework, which is the one we're using right now.
We need something with the following characteristic:
- No Q-learning such as DQN, Actor/Critic is better.
- IMPALA cannot be used since lack implementation in ray (here)
Check here Is says to use either MADDPG or MARWIL.
From our paper
and more
MODEL based approaches seems the one working better in MAS envs.
So we can exclude the following:
- PPO
- DQN
- PG
- A2C
- A3C
- DDPG
- TD3
- APEX
More here. Main points:
- No good when reward is zero -> high baseline for reward (must be small)
More here. Main points:
- Formalizes the constraint as a penalty in the objective function.
- Can use a first-order optimizer.
More here. Main points:
- Makes actions with high rewards more likely, or vice versa.
Naive , task to hard for this simplistic approach.
More here. Main points:
- Actor directly maps states to actions (the output of the network directly the output) instead of outputting the probability distribution across a discrete action space.
- Improve stability in learning
More here Trick One: Clipped Double-Q Learning. TD3 learns two Q-functions instead of one (hence “twin”), and uses the smaller of the two Q-values to form the targets in the Bellman error loss functions.
Trick Two: “Delayed” Policy Updates. TD3 updates the policy (and target networks) less frequently than the Q-function. The paper recommends one policy update for every two Q-function updates.
Trick Three: Target Policy Smoothing. TD3 adds noise to the target action, to make it harder for the policy to exploit Q-function errors by smoothing out Q along changes in action.
Avoids noisy gradients and high variance. The “Critic” estimates the value function. This could be the action-value (the Q value) or state-value (the V value). The “Actor” updates the policy distribution in the direction suggested by the Critic (such as with policy gradients).
Alternatively, one can use model-based policy optimization which can learn optimal policies via back-propagation, but this requires a (differentiable) model of the world dynamics and assumptions about the interactions between agents
Introduce by deep mind here. Main points:
- Implements parallel training where multiple workers in parallel environments independently update a global value function.
- Not proved to be better than A2C
Main points:
- A2C is like A3C but without the asynchronous part.
More here. Main points:
- SAC is an off-policy algorithm.
- Used for environments with continuous action spaces.
- Entropy regularization. The policy is trained to maximize a trade-off between expected return and entropy, a measure of randomness in the policy. This has a close connection to the exploration-exploitation trade-off: increasing entropy results in more exploration, which can accelerate learning later on
Importance Weighted Actor-Learner Architecture
- different actors and learners that can collaborate to build knowledge across different domains
- completely independent actors and learners. This simple architecture enables the learner(s) to be accelerated using GPUs and actors to be easily distributed across many machines.
- mitigate the lag between when actions are generated by the actors and when the learner estimates the gradient.
- efficiently operate in multi-task environments.
Multi-Agent Actor-Critic for MixedCooperative-Competitive Environments.
Here's the paper and here's the implemenation.
- leads to learned policies that only use local information (i.e. their own observations) at execution time
- does not assume a differentiable model of the environment dynamics or any particular structure on the communication method between agents
- is applicable not only to cooperative interaction but to competitive or mixed interaction involving both physical and communicative behavior.
Exponentially Weighted Imitation Learning for Batched Historical Data. More here.
Main points:
- applicable to problems with complex nonlinear function approximation
- works well with hybrid (discrete and continuous) action space: both acceleration and lane switching
- an be used to learn from data generated by an unknown policy
- batched historical trajectories.
Check this for model type
The algorithm decouples acting from learning: the actors interact with their own instances of the environment by selecting actions according to a shared neural network, and accumulate the resulting experience in a shared experience replay memory; the learner replays samples of experience and updates the neural network. The architecture relies on prioritized experience replay to focus only on the most significant data generated by the actors.
More here. Variations:
- APEX
- APEX_DDPG [X]
More here. Main points:
- ES resembles simple hill-climbing in a high-dimensional space based only on finite differences along a few random directions at each step.
More here. Main points:
- DQN is a RL technique that is aimed at choosing the best action for given circumstances (observation).
We wont use it since it is prone to high variance in the training phase.
Scheduler is used for schedule training procedures, can parallelize, can take best out of N population. Used for tuning.
- Currently trying population-based-training
This subsection is dedicated to the RL agent functions
Action space is continuous since we're using acceleration as output.
Something
Ideas
Pro and cons of the ‘clip action’ function
####Common way: Actor-Critic Methods Link natural extension of the idea of reinforcement comparison methods to TD learning and to the full reinforcement learning problem
Application of the self-Organising Map to Reinforcement Learning
Section4 provides information for continous action space.