-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Problem Statement
When multi-step AI workflows fail partway through execution, users must restart from the beginning. This results in:
- Loss of completed work and progress
- Unnecessary costs from re-running successful tasks
- Wasted time repeating already-completed operations
- Frustration when long-running workflows fail near completion
User Stories
As a user, I want to:
- Resume failed workflows from the last successful checkpoint
- See which tasks completed successfully before failure
- Review outputs from completed tasks
- Choose whether to resume, re-run specific tasks, or start fresh
- Understand what external data has changed since the failure
- Make informed decisions about resume safety
Functional Requirements
Core Resume Capability
- System preserves outputs from each completed task
- Failed executions can be resumed from any completed checkpoint
- Previous task outputs are available as context for remaining tasks
- Parent-child execution relationships are maintained for audit trail
Data Consistency Protection
- System detects changes in external dependencies (vector searches, databases, APIs)
- Risk assessment shows impact of changes on workflow consistency
- Multiple resume strategies offered based on risk level
- Clear visual indicators of resume safety (green/yellow/orange/red)
User Controls
- "Resume" button on failed/stopped executions
- Resume analysis dialog showing:
- Completed vs remaining tasks
- Detected changes in external systems
- Risk assessment and recommendations
- Available resume strategies
- Option to re-run only affected tasks
- Force resume option for advanced users
Acceptance Criteria
- Task outputs are automatically preserved during execution
- Failed executions display a "Resume" action
- Resume analysis correctly identifies data consistency risks
- Users can view completed task outputs before resuming
- System provides clear risk assessment with color coding
- Multiple resume strategies are available based on risk
- Resumed workflows skip already-completed tasks
- Previous outputs are injected as context for remaining tasks
- Audit trail links resumed execution to original
- Old checkpoint data is cleaned up after configurable retention period
Success Metrics
- Reduction in repeated task executions by 60-80%
- Cost savings from avoided LLM API calls
- Improved workflow completion rate
- User satisfaction with resume experience
- Reduced time to recover from failures
Out of Scope
- Modifying CrewAI core library
- Real-time workflow state synchronization
- Automatic resume without user intervention
- Cross-version workflow compatibility
Dependencies
- Existing ExecutionTrace system for output capture
- TaskStatus tracking for execution state
- Vector search and database dependency tracking
- Frontend dialog components for user interaction
Priority
High - Directly impacts user productivity and platform costs
Labels
enhancement, user-experience, execution, resilience, cost-optimization
Estimated Impact
- User Benefit: High - Saves time and reduces frustration
- Cost Benefit: High - Reduces unnecessary LLM API calls
- Technical Complexity: Medium - Builds on existing infrastructure
- Risk: Low-Medium - With proper safety checks
Additional Context
This feature is particularly valuable for:
- Long-running analytical workflows
- Multi-stage data processing pipelines
- Workflows with expensive LLM operations
- Development and testing iterations
- Production workflows with transient failure points
The implementation should prioritize safety and transparency, ensuring users understand the implications of resuming with potentially changed data while still providing flexibility for different use cases.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request