TODO

Change echo for "conda activate dmas" to support "source activate dmas" [X]
Update README.md to include the activation of the env, and the approval of uninstalling flow
Update README.md to include running the goal simulation.py
Expand README.md to specify an appropriate OS (Ubuntu 18.04) [X]
Expand README.md to include installation of git and navigation to the folder [X]
Use flow original repo and override files with shell

Starting

Install Flow dependencies, Sumo, RLlib
BEWARE do not install flow in your python environment, we have a local copy of the repo in here that needs to be changed.
Complete tutorials in FLOW, especially (0,1,3,5,6,8,11)

Project

Network : completed

Using grid map from flow tutorial.

Failed attempt with OSM maps

Unfortunately OSM maps do not work for their inability to programmatically generate routes, which are then use for checking if an agent has exited the map or not. This leads to the disappearing of agents during one episode, thus having an array of samples with different lengths for every agent (see this).

A workaround was introduced here but for that one step in which an agent is deleted and reintroduced in the map, an observation is missing. This leads to the same error from before.

A possible solution could be to fill holes in the observation array by adding a fake observation, but I don't know how this will influence the training.

Previous network todo

We can either import map with OpenStreetMap or create a custom one with Scenario.

Starting with a custom grid map [X]
Importing scenario from openmap [X]
Use lust scenario without traffic lights: not possible, switching to OSM networks
Set inflows from random starting edges in the OSM network [X]
Add GroningenOSM to the map_utils [x]

Premade network

We can import a pre-made network as in tutorial 4, here's a list:

Monaco
Lust

This approach has been discarded since there is no easy way to remove traffic lights (network geometry) form this imported scenarios. Using OSM instead.

Environment

Check out the environment tutorial for this part. This part covers the Autonomous agents using RL.

The custom environments must be placed in the envs dir and you should add it to the init file. There is already a customRL Agent which can be modified as you please.

I advise yuo to use the third tutorial to implement the agent (remember to set the DEGUB option to true in the Parameter file ) and, when you are ready, you can train it on the simulation.

For every custom agent the following function must be implemented:

Action space (using gym)

Action space is used to tell the Agent it what can/cannot do. Notice that deceleration and acceleration are considered just one param

Deceleration: using comfortable deceleration rate at 3.5 m/s^2 as stated here
Acceleration: using 10m/s^2 (ferrari acceleration rate), should look into this wiki link which states that comfortable acceleration for 10 seconds is 20g (200m/s^2) and 10g for 1 minute
Lane_changing: todo
Message: todo, check what happens when AA are able to send a float (0-1) to neighbors cars

Observable space

Define what cars know about each other (turning direction), if you go by neighbors check out getNeighbors in the TraCi documentation

Note that each observation should be scaled 0-1

Current Observation:

agent speed
difference between lead speed and agent
distance from leader
difference between agent speed and follower
distance from follower
number of neighbors (not scaled obv)
average neighbors speed
average neighbors acceleration

To add

Messages from neighbors AA (as written in above)

Reward

The current reward is the sum of the following params:

Cooperative: (system_delay + system_standstill_time) * cooperative_weight
Selfish: agent_specific_delay
Scaled jerk: the lower the better

Modifying flow/flow/core/kernel/vehicle/traci.py

If you wish to add more functions to the traci file you need to look into the Vehicle file which can be found inside your conda envs, for example mine is in:

/anaconda3/envs/dmas/lib/python3.6/site-packages/traci/_vehicle.py

Training algorithm

Add differentiation between MADDPG for cooperative AA and DPG for selfish ones [X]

Related papers

Reward function for accelerated learning
1Safe, Efficient, and Comfortable Velocity Controlbased on Reinforcement Learning forAutonomous Driving

Training

Use gpu support [x]
Print debug infos and training progress [x]

Evaluation

The evaluation functions are in the train_utils.py.

on_episode_step

Reward: min, max, mean; split for kind of agent [X]
Delay: mean, max, min, standstill (mean jerk ?) [X]
Actions :
- Acceleration: mean/max/min

on_episode_end

Same as step

on_episode_start

Env params [x]

Research

This section will be about research topics we will add to the project. For every issue we will address a paper with a brief description of how the issue can be addressed as well as a reference to the paper itself.

Training

For this kind of issue take a look at the ray's tune framework, which is the one we're using right now.

We need something with the following characteristic:

Compatible with continuous action space

No Q-learning such as DQN, Actor/Critic is better.
IMPALA cannot be used since lack implementation in ray (here)

Working in a MAS fashion

Check here Is says to use either MADDPG or MARWIL.

Model based

From our paper and more
MODEL based approaches seems the one working better in MAS envs. So we can exclude the following:

PPO
DQN
PG
A2C
A3C
DDPG
TD3
APEX

Policy Based [X]

More here. Main points:

No good when reward is zero -> high baseline for reward (must be small)

PPO [X]

More here. Main points:

Formalizes the constraint as a penalty in the objective function.
Can use a first-order optimizer.

PG [X]

More here. Main points:

Makes actions with high rewards more likely, or vice versa.

Naive , task to hard for this simplistic approach.

DDPG [X]

More here. Main points:

Actor directly maps states to actions (the output of the network directly the output) instead of outputting the probability distribution across a discrete action space.
Improve stability in learning

TD3 [X]

More here Trick One: Clipped Double-Q Learning. TD3 learns two Q-functions instead of one (hence “twin”), and uses the smaller of the two Q-values to form the targets in the Bellman error loss functions.

Trick Two: “Delayed” Policy Updates. TD3 updates the policy (and target networks) less frequently than the Q-function. The paper recommends one policy update for every two Q-function updates.

Trick Three: Target Policy Smoothing. TD3 adds noise to the target action, to make it harder for the policy to exploit Q-function errors by smoothing out Q along changes in action.

Actor Critic Methods

Avoids noisy gradients and high variance. The “Critic” estimates the value function. This could be the action-value (the Q value) or state-value (the V value). The “Actor” updates the policy distribution in the direction suggested by the Critic (such as with policy gradients).

Alternatively, one can use model-based policy optimization which can learn optimal policies via back-propagation, but this requires a (differentiable) model of the world dynamics and assumptions about the interactions between agents

A3C [X]

Introduce by deep mind here. Main points:

Implements parallel training where multiple workers in parallel environments independently update a global value function.
Not proved to be better than A2C

A2C [X]

Main points:

A2C is like A3C but without the asynchronous part.

SAC

More here. Main points:

SAC is an off-policy algorithm.
Used for environments with continuous action spaces.
Entropy regularization. The policy is trained to maximize a trade-off between expected return and entropy, a measure of randomness in the policy. This has a close connection to the exploration-exploitation trade-off: increasing entropy results in more exploration, which can accelerate learning later on

IMPALA [X]

Importance Weighted Actor-Learner Architecture

different actors and learners that can collaborate to build knowledge across different domains
completely independent actors and learners. This simple architecture enables the learner(s) to be accelerated using GPUs and actors to be easily distributed across many machines.
mitigate the lag between when actions are generated by the actors and when the learner estimates the gradient.
efficiently operate in multi-task environments.

MAS

MADDPG

Multi-Agent Actor-Critic for MixedCooperative-Competitive Environments.

Here's the paper and here's the implemenation.

leads to learned policies that only use local information (i.e. their own observations) at execution time
does not assume a differentiable model of the environment dynamics or any particular structure on the communication method between agents
is applicable not only to cooperative interaction but to competitive or mixed interaction involving both physical and communicative behavior.

MARWIL

Exponentially Weighted Imitation Learning for Batched Historical Data. More here.

Main points:

applicable to problems with complex nonlinear function approximation
works well with hybrid (discrete and continuous) action space: both acceleration and lane switching
an be used to learn from data generated by an unknown policy
batched historical trajectories.

Check this for model type

APEX[X]

The algorithm decouples acting from learning: the actors interact with their own instances of the environment by selecting actions according to a shared neural network, and accumulate the resulting experience in a shared experience replay memory; the learner replays samples of experience and updates the neural network. The architecture relies on prioritized experience replay to focus only on the most significant data generated by the actors.

More here. Variations:

APEX
APEX_DDPG [X]

Search[X]

ES [X]

More here. Main points:

ES resembles simple hill-climbing in a high-dimensional space based only on finite differences along a few random directions at each step.

ARS [X]

More here

Other[X]

DQN [X]

More here. Main points:

DQN is a RL technique that is aimed at choosing the best action for given circumstances (observation).

We wont use it since it is prone to high variance in the training phase.

Scheduler

Scheduler is used for schedule training procedures, can parallelize, can take best out of N population. Used for tuning.

Currently trying population-based-training

RL Agent

This subsection is dedicated to the RL agent functions

Action space

Action space is continuous since we're using acceleration as output.

Observation space

Something

Reward function

Ideas

Clip action

Pro and cons of the ‘clip action’ function

Continuous action space RL

####Common way: Actor-Critic Methods Link natural extension of the idea of reinforcement comparison methods to TD learning and to the full reinforcement learning problem

Paper Found

Application of the self-Organising Map to Reinforcement Learning

Deep Reinforcement Learning in Continuous Action Spaces: a Case Study in the Game of Simulated Curling

Section4 provides information for continous action space.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!