Teaching DX1, my quadruped robot, how to walk using reinforcement learning.
DX1.MOV
This project began by designing a custom robot dog model in Fusion360 and importing it into Isaac Sim as a USD file. The model includes separate USD files for base geometry, physics properties, and sensor configurations. Terrain and rigid body simulation parameters were configured to enable reinforcement learning training.
Choosing the right reward functions is critical for emerging a proper walking gait. Otherwise, the agent can learn unwanted behaviors like crawling (dragging the body) or jumping (excessive vertical motion). This project uses the PPO algorithm, chosen for its on-policy nature and built-in clipping mechanism. On-policy learning provides more stable, conservative updates by only using data from the current policy, compared to off-policy methods like SAC which reuse old data (more sample-efficient but prone to distribution mismatch). PPO's clipping constrains policy updates, making it safer where large changes can cause instability.
Primary task rewards:
track_lin_vel_xy_exp: Primary task reward for tracking desired linear velocity commands in the xy-plane.track_ang_vel_z_exp: Rewards tracking desired angular velocity (yaw) commands.
Behavior-shaping rewards:
feet_air_time: Rewards proper foot lifting above a threshold (0.5s), preventing crawling behavior and encouraging rhythmic step cycles.flat_orientation_l2: Maintains stable body orientation during locomotion.undesired_contacts: Prevents contact on non-foot body parts (thighs, base), ensuring only feet touch the ground.
Stability and efficiency penalties:
lin_vel_z_l2: Penalizes vertical velocity to prevent jumping and keep the robot grounded.ang_vel_xy_l2: Penalizes unwanted angular velocities (roll/pitch) for stability.dof_torques_l2: Encourages energy-efficient movements by penalizing joint torques.action_rate_l2: Penalizes rapid action changes for smoother control.
The combination of these rewards shapes a stable, omnidirectional walking gait that tracks velocity commands while avoiding crawling and jumping behaviors, emerging a natural-looking walking gait.
To bridge the sim-to-real gap, domain randomization is applied during training, including randomizing robot mass distribution, initial poses and velocities, joint positions, and periodic external disturbances (lateral pushes every 10-15s). This helps make the policy robust and generalize to real-world variations in terrain and unmodeled dynamics, improving transfer from sim-to-real.
