A deep reinforcement learning system that learns optimal liquidity provision strategies for Uniswap V3-style concentrated liquidity AMMs.
Trained PPO agent dynamically repositioning liquidity in response to price movements, balancing fee capture against impermanent loss and gas costs.
This project tackles the challenging optimization problem of concentrated liquidity provision in decentralized finance. Unlike traditional AMMs where liquidity is spread uniformly across all prices, Uniswap V3 introduced concentrated liquidity, allowing liquidity providers to allocate capital within specific price ranges. This dramatically increases capital efficiency but introduces complex strategic decisions that are difficult to optimize manually.
The core challenge lies in three interrelated decisions. First, range selection determines the tradeoff between capital efficiency and risk: narrow ranges earn higher fees per unit of liquidity but risk going out-of-range when prices move. Second, repositioning timing involves deciding when the cost of paying gas to adjust a position is justified by improved fee capture or reduced impermanent loss. Third, market adaptation requires understanding how to respond to changing volatility regimes, where strategies optimal in calm markets may fail during high volatility.
The agent learns these strategies through Proximal Policy Optimization (PPO), a state-of-the-art policy gradient algorithm. Through millions of simulated market interactions, the agent develops nuanced policies that understand market microstructure and consistently outperform naive baseline strategies.
Training uses vectorized environments (4-8 parallel simulations via Gymnasium's SyncVectorEnv) to collect diverse experience efficiently. Each environment runs with a unique seed, reducing sample correlation and stabilizing policy gradient updates.
Curriculum learning progressively increases market volatility from 35% to 50% annualized over the first 70% of training. Starting with calmer markets provides clearer reward signals for discovering effective strategies, which then transfer to higher-volatility conditions.
The environment simulates realistic market dynamics using Geometric Brownian Motion (GBM) for price evolution with configurable drift and volatility parameters. Swap events are generated based on price movements, with token deltas calculated using proper AMM liquidity math. The simulation includes competitor liquidity modeling, where heterogeneous LP positions are distributed around the current price with log-normal range widths, creating a realistic liquidity landscape that affects fee sharing.
The cost model reflects Layer 2 deployment reality, with gas costs of $0.05 per transaction (realistic for Base or Arbitrum), DEX fees matching the environment's fee tier, and price impact calculated using the constant-product formula based on swap size relative to available liquidity. Pool calibration targets real ETH/USDC pool characteristics with approximately $9.5M TVL and $70M daily volume.
The state space comprises 25 carefully chosen dimensions that provide the agent with actionable market information. Beyond basic price and position state, the observation includes technical indicators such as RSI, Bollinger Band position, and EMA ratios that capture momentum and mean-reversion signals. Market structure features include competitor liquidity distribution parameters (mean and sigma of liquidity-weighted competitor positions), helping the agent understand competitive dynamics and identify opportunities. Temporal features (hour of day, day of week) enable learning of any cyclical patterns in market behavior.
The action space is continuous and four-dimensional. The first dimension controls range width, mapping from 5% to 50% total range around the current price. The second dimension controls asymmetry, allowing the agent to skew positions bearishly or bullishly. The third dimension is a hold signal that determines whether to execute a reposition or maintain the current position. The fourth dimension is the shape parameter that controls liquidity distribution within the range. This explicit hold action, combined with a 12-step minimum hold period, teaches the agent to consider repositioning costs strategically rather than churning positions.
A key innovation is the shaped liquidity feature, which allows non-uniform liquidity allocation within a position's price range. Instead of deploying liquidity uniformly (like standard V3 positions), the agent controls the distribution shape:
- Shape = -1 (U-shaped): More liquidity at the edges of the range, useful for capturing fees from large price movements while maintaining presence at the center
- Shape = 0 (Flat): Uniform distribution matching standard V3 behavior
- Shape = +1 (Bell-shaped): Concentrated liquidity at the center, maximizing fee capture when price stays near the middle of the range
This is implemented via 7 sub-positions with weights determined by the shape parameter. Each sub-position covers a portion of the overall range, with liquidity allocated according to the weight function. Fee calculations properly aggregate across all sub-positions based on which ones are active at each price.
Tick spacing can be configured for real V3 deployment compatibility. When set to 60 (corresponding to the 0.3% fee tier), positions snap to valid tick boundaries, ensuring trained policies can execute on actual Uniswap V3 contracts.
The PPO implementation includes several features critical for stable training in continuous control. Generalized Advantage Estimation (GAE) with lambda=0.95 balances bias and variance in advantage estimates. Observation normalization using running statistics (Welford's algorithm) ensures inputs to the neural network remain well-scaled throughout training. The actor network uses learnable log-standard-deviation parameters with bounds preventing exploration collapse while allowing the policy to become more deterministic as it improves.
Training supports checkpoint saving at configurable intervals and full resumption from saved states, including optimizer state, observation normalizer statistics, and random number generator states. TensorBoard integration provides real-time monitoring of rewards, policy entropy, value loss, and curriculum progression.
src/
├── env/
│ ├── ppo_env.py # Gymnasium environment (state/action/reward)
│ ├── amm_math.py # Uniswap V3 liquidity mathematics
│ ├── price_process.py # Market simulation engine
│ └── data_loader.py # Historical data replay capability
├── ppo_networks.py # Actor-Critic neural architecture
├── ppo.py # PPO algorithm implementation
├── ppo_buffer.py # Rollout buffer with GAE
├── train_ppo.py # Training orchestration
├── ppo_baselines.py # Benchmark strategies
├── ppo_evaluate.py # Evaluation framework
└── ppo_visualization.py # Animated visualization generation
# Install dependencies
uv sync
# Train the agent (default: 1M timesteps with curriculum learning)
uv run python -m src.train_ppo
# Train with custom configuration
uv run python -m src.train_ppo --total-timesteps 500000 --n-envs 8 --seed 42
# Disable curriculum learning (use fixed volatility)
uv run python -m src.train_ppo --no-curriculum --volatility 0.5
# Monitor training progress
uv run tensorboard --logdir experiments/
# Evaluate trained agent against baselines
uv run python -m src.ppo_evaluate --baselines-only
# Generate animated visualization
uv run python -m src.ppo_visualization --animate --checkpoint experiments/run/final_model.ptSee the Usage Guide for detailed training options, evaluation procedures, and visualization commands.
| Category | Dimensions | Features | Description |
|---|---|---|---|
| Price & Market | 2 | norm_price, volatility |
Current price normalized, realized volatility |
| Position State | 4 | price_to_lower, price_to_upper, range_width_pct, in_range |
Log-distance to bounds, range width, binary in-range |
| Position Status | 1 | unrealized_il |
Impermanent loss as percentage |
| Performance | 2 | fees_norm, portfolio_norm |
Cumulative fees, portfolio value |
| Reposition Cost | 1 | last_reposition_cost_norm |
Cost visibility for strategic decisions |
| Rolling Obs | 4 | position_age_saturated, recent_fee_rate, breakeven_progress, portfolio_momentum |
Position age (saturates at 24h), rolling 24h fee rate, breakeven progress vs reposition cost, 24h portfolio momentum (tanh-bounded) |
| Technical | 4 | volatility_ratio, ema_ratio, rsi, bb_position |
Short/long vol, EMA12/26, RSI [-1,1], Bollinger |
| Momentum | 3 | momentum_short, momentum_medium, momentum_long |
1h/24h/168h momentum signals |
| Market Structure | 2 | competitor_mean, competitor_sigma |
Liquidity-weighted center and spread of competitor positions |
| Temporal | 2 | hour_of_day, day_of_week |
Cyclical time features |
| Dimension | Range | Mapping | Description |
|---|---|---|---|
range_width |
[0, 1] | 5% → 50% | Total range around current price |
asymmetry |
[-1, 1] | Bearish → Bullish | -1: all below price, 0: centered, +1: all above |
hold_signal |
[0, 1] | Hold / Reposition | ≤0.5: hold current position, >0.5: execute reposition |
shape |
[-1, 1] | U-shaped → Bell | Liquidity distribution (-1: edges, 0: flat, +1: center) |
The agent always deploys 100% of capital when repositioning via 7 sub-positions with shape-controlled weights. A 6-step (6-hour) minimum hold period prevents excessive churning.
reward = (portfolio_change) - (hold_baseline_change)
The reward uses baseline subtraction to isolate the agent's actual value-add from pure price movements. The hold baseline tracks what the same token mix would be worth if simply held (no LP, no fees). This removes price-driven noise that would otherwise dominate the reward signal.
The adjusted reward captures only:
- Fees earned — the agent's primary value-add
- Impermanent loss — the cost of providing concentrated liquidity vs holding
- Reposition costs — gas, DEX fees, and price impact
Pure price movements are removed, allowing the agent to learn strategies based on fee capture and loss minimization rather than inadvertently learning to predict market direction.
The agent is evaluated against several baseline strategies to contextualize performance. The Hold baseline simply holds the initial token allocation without providing liquidity, representing the opportunity cost of LP participation. Full-Range LP deploys liquidity across the entire price range in V2 style, earning fees on all trades but with minimal capital efficiency. Fixed Narrow uses a static tight range with reactive repositioning when price exits the range, representing a simple active strategy. Adaptive Narrow adjusts range width based on recent volatility, widening during volatile periods and tightening during calm markets.
The concentrated liquidity implementation includes full Uniswap V3 tick math with proper liquidity calculations for position deployment and withdrawal. Swap mechanics correctly handle range crossings, calculating the fraction of each swap that occurs within a position's active range. Impermanent loss is tracked in real-time, computed as the difference between position value and the value of holding the equivalent tokens.
Numerical stability receives careful attention throughout. All observations are bounded to finite ranges, preventing NaN propagation in the neural network. Gradient clipping and advantage normalization stabilize policy updates. The learnable action variance uses bounded log-standard-deviation to prevent both exploration collapse and excessive randomness.
For background on concentrated liquidity mechanics, the Uniswap V3 Whitepaper provides the authoritative reference. The Concentrated Liquidity Math document offers detailed derivations of the mathematical relationships between liquidity, price bounds, and token amounts.
The original PPO Paper by Schulman et al. introduces the algorithm and its theoretical motivation. For implementation guidance, the blog post Implementation Details of PPO covers practical considerations that significantly impact training performance.
MIT
