This is the repository for the LinkedIn Learning course Operating AI Agents: Failure and Recovery. The full course is available from [LinkedIn Learning][lil-course-url].
As AI agents shift from experimentation to production, operational failures can create serious business risks. This intermediate course explores practical techniques for monitoring agent behavior, tracing execution paths, and identifying failure modes across single‑ and multi‑agent systems. Through hands-on GitHub Codespaces exercises, you learn how to implement rollback mechanisms, build automated recovery workflows, and create reports that surface agent health and system status in real time. By the end of the course, you’ll have the skills to improve the safety and predictability of AI agents in production, and to respond quickly and effectively when failures occur.
- This course, Operating AI Agents: Failure and Recovery, is the second course in the governing AI agents series. The first course is Governing AI Agents: Visibility and Control.
- Python 3.9+
- An OpenAI API key
- Clone this repo (or download the files).
- Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # macOS/Linux venv\Scripts\activate # Windows
- Install dependencies:
pip install -r requirements.txt
- Set your OpenAI API key or place in .env file:
export OPENAI_API_KEY="your_api_key" # macOS/Linux setx OPENAI_API_KEY "your_api_key" # Windows PowerShell
Kesha Williams
Award-Winning Tech Innovator and AI/ML Leader