This document outlines the complete implementation plan for adding video/audio recording capabilities to the AI-powered interview coach system. The implementation will be 100% local and open-source, integrating seamlessly with the existing architecture.
- Backend: FastAPI with SQLAlchemy ORM, PostgreSQL database, Redis caching
- Frontend: React/TypeScript with Material-UI components
- Interview Flow: Session creation → Question display → Answer submission → AI evaluation
- File Storage: Local storage pattern in
uploads/directory with UUID-prefixed filenames - Database: Alembic migrations, soft deletes, proper indexing and relationships
- Answer Model: Currently stores
answer_text,time_taken,submitted_at - File Upload Pattern: Existing
file_upload.pyutility for resume uploads - API Structure: RESTful endpoints under
/interviewsnamespace - Frontend Components:
InterviewSessionPage.tsxhandles question display and answer submission
File: Ai_powered_interview_coach/backend/app/models/answer.py
Add new columns to support recording data:
# Recording fields
audio_url = Column(String(500), nullable=True)
video_url = Column(String(500), nullable=True)
recording_duration = Column(Float, nullable=True) # Duration in seconds
recording_format = Column(String(20), nullable=True) # webm, mp4, etc.
transcription = Column(Text, nullable=True)
voice_analysis = Column(JSON, nullable=True) # Speaking pace, filler words, etc.File: Ai_powered_interview_coach/backend/alembic/versions/008_add_recording_fields.py
Generate Alembic migration to add recording fields to existing answers table:
- Add columns with NULL defaults for backward compatibility
- Ensure existing functionality continues working
- Add indexes for recording_url fields for efficient queries
File: Ai_powered_interview_coach/backend/app/services/media_service.py
Create comprehensive media processing service:
class MediaService:
def __init__(self):
self.whisper_model = None # Lazy load
async def process_recording(self, audio_file, video_file=None, user_id: int, question_id: int):
# 1. Validate files (format, size, duration)
# 2. Save to permanent storage
# 3. Extract audio from video if needed
# 4. Run Whisper transcription
# 5. Perform voice analysis with librosa
# 6. Return structured resultsKey Features:
- Singleton Whisper Model: Load once, reuse for all transcriptions
- CPU/GPU Detection: Automatically use GPU if available, fallback to CPU
- File Validation: Check format (webm), size (100MB max), duration limits
- Error Handling: Graceful degradation if transcription fails
- Memory Management: Clean temporary files, prevent memory leaks
File: Ai_powered_interview_coach/backend/app/utils/media_storage.py
Extend existing file storage pattern:
# Directory structure
storage/
├── media/
│ ├── audio/ # Audio recordings
│ ├── video/ # Video recordings
│ └── temp/ # Temporary processing filesSecurity Features:
- User-specific subdirectories
- Filename format:
user_{user_id}_question_{question_id}_{timestamp}.webm - Path traversal prevention
- File access authorization
File: Ai_powered_interview_coach/backend/app/routes/media.py
@router.post("/upload-recording")
async def upload_recording(
audio_file: UploadFile,
video_file: Optional[UploadFile] = None,
question_id: int = Form(...),
session_id: int = Form(...),
current_user: User = Depends(get_current_user),
db: Session = Depends(get_db)
):
# 1. Validate session ownership
# 2. Process recording files
# 3. Update answer record with recording data
# 4. Return transcription and analysis resultsFile: Ai_powered_interview_coach/backend/app/main.py
Mount static file serving for recordings:
app.mount("/media", StaticFiles(directory="storage/media"), name="media")Authorization: Middleware to ensure users can only access their own recordings.
File: Ai_powered_interview_coach/backend/app/routes/media.py
@router.get("/health")
async def media_health_check():
return {
"whisper_loaded": whisper_model is not None,
"storage_accessible": check_storage_directories(),
"processing_ready": True
}File: Ai_powered_interview_coach/frontend/src/hooks/useMediaRecorder.ts
Custom React hook for recording functionality:
export const useMediaRecorder = () => {
const [isRecording, setIsRecording] = useState(false);
const [recordingTime, setRecordingTime] = useState(0);
const [mediaRecorder, setMediaRecorder] = useState<MediaRecorder | null>(null);
const startRecording = async (includeVideo: boolean = false) => {
// Request permissions
// Initialize MediaRecorder
// Start recording timer
};
const stopRecording = async (): Promise<Blob[]> => {
// Stop recording
// Return audio/video blobs
};
};File: Ai_powered_interview_coach/frontend/src/components/interview/RecordingControls.tsx
Recording interface component:
interface RecordingControlsProps {
onRecordingComplete: (audioBlob: Blob, videoBlob?: Blob) => void;
disabled?: boolean;
}
export const RecordingControls: React.FC<RecordingControlsProps> = ({
onRecordingComplete,
disabled = false
}) => {
// Recording button with visual feedback
// Timer display
// Permission status
// Error handling UI
};File: Ai_powered_interview_coach/frontend/src/pages/interview/InterviewSessionPage.tsx
Integrate recording into existing interview flow:
- Add recording controls below answer textarea
- Upload recording when answer is submitted
- Display transcription results
- Show voice analysis feedback
File: Ai_powered_interview_coach/backend/requirements.txt
Add new dependencies:
# Audio/Video Processing
faster-whisper==0.10.0
librosa==0.10.1
soundfile==0.12.1
ffmpeg-python==0.2.0
# File format support
python-magic==0.4.27
- FFmpeg: Required for audio/video processing
- CUDA (optional): For GPU-accelerated Whisper transcription
- Storage: Adequate disk space for recording files
Using librosa for voice analysis:
- Speaking Pace: Words per minute calculation
- Pause Analysis: Silence detection and timing
- Filler Words: Detection of "um", "uh", "like", etc.
- Volume Consistency: Audio level analysis
- Confidence Indicators: Voice stability metrics
voice_analysis = {
"speaking_pace_wpm": 150,
"total_speaking_time": 45.2,
"pause_count": 8,
"average_pause_duration": 1.2,
"filler_word_count": 3,
"filler_words": ["um", "uh"],
"volume_consistency": 0.85,
"confidence_score": 0.78
}- Transcription Failure: Continue with text-only answer
- Recording Upload Failure: Retry mechanism with exponential backoff
- Processing Timeout: Background processing for long recordings
- Storage Issues: Fallback to temporary storage with cleanup
- Permission Handling: Clear messaging for microphone/camera access
- Progress Indicators: Upload progress and processing status
- Fallback Options: Always allow text-only answers
- Error Recovery: Retry options and clear error messages
- Lazy Loading: Load Whisper model only when needed
- Background Tasks: Use Celery for long-running transcription
- File Compression: Optimize recording file sizes
- Caching: Cache transcription results
- File Cleanup: Automatic cleanup of old recordings
- Compression: Compress older recordings to save space
- Monitoring: Track storage usage and performance
- Upload Validation: Strict file type and size validation
- Path Security: Prevent directory traversal attacks
- Access Control: User-specific file access only
- Sanitization: Clean filenames and metadata
- Local Processing: All transcription and analysis done locally
- Data Retention: Clear policies for recording storage
- User Control: Options to delete recordings
- Encryption: Consider encrypting stored recordings
- Media service functionality
- File upload and validation
- Transcription accuracy
- Voice analysis metrics
- End-to-end recording workflow
- API endpoint functionality
- Frontend recording components
- Error handling scenarios
- Large file handling
- Concurrent recording processing
- Storage capacity limits
- Memory usage monitoring
- FFmpeg installation instructions
- Whisper model download and setup
- Storage directory configuration
- Permission requirements
- Environment variables for storage paths
- Recording quality settings
- Processing timeout configurations
- Cleanup job scheduling
- API endpoint documentation
- Database schema changes
- Configuration options
- Troubleshooting guide
- Recording feature overview
- Browser compatibility requirements
- Permission setup instructions
- Best practices for recording
- Phase 1 (Database & Backend Core): Steps 1-3 (Database, Media Service, API)
- Phase 2 (Frontend Integration): Steps 4-5 (Recording UI, Dependencies)
- Phase 3 (Analysis & Polish): Steps 6-8 (Voice Analysis, Error Handling, Performance)
- Phase 4 (Security & Testing): Steps 9-10 (Security, Testing)
- Phase 5 (Deployment & Docs): Steps 11-12 (Deployment, Documentation)
- ✅ Users can record audio/video during interviews
- ✅ Recordings are transcribed locally using Whisper
- ✅ Voice analysis provides meaningful feedback
- ✅ System remains 100% local and open-source
- ✅ Existing functionality is not broken
- ✅ Performance is acceptable for typical use cases
- ✅ Error handling provides good user experience
- Compatibility Issues: Extensive browser testing
- Performance Problems: Background processing and optimization
- Storage Limitations: Monitoring and cleanup strategies
- User Adoption: Clear documentation and gradual rollout
- Technical Complexity: Modular implementation with fallbacks
This implementation plan ensures a robust, secure, and user-friendly recording system that integrates seamlessly with the existing interview coach architecture while maintaining the requirement for 100% local, open-source operation.