An advanced AI-powered text-to-speech system that analyzes text emotion and applies dynamic vocal parameter modulation and stuttering effects for empathetic speech synthesis. The system uses ElevenLabs API for high-quality voice generation with emotion-specific voice settings and real-time audio processing.
- HuggingFace Transformers: Uses DistilBERT for real-time sentiment analysis
- Three Emotion Categories: Positive, Negative, Neutral classification
- Confidence Scoring: Confidence levels for emotion accuracy
- Automatic Processing: Background analysis without user intervention
- Stuttering for Sad Emotions: Automatic stuttering effects for negative emotions (>70% confidence)
- Dynamic Text Preprocessing: Intelligent word selection for stuttering (emotional words + sentence beginnings)
- Natural Patterns: Creates realistic stuttering like "I-I-I feel re-re-really sad"
- Rate Control: Speech speed modulation (0.5x to 2.0x)
- Pitch Shifting: Tonal adjustment (-12 to +12 semitones)
- Volume Control: Audio amplitude (-20dB to +20dB)
- Real-time Processing: Live audio modulation using Librosa
- Stability: Voice consistency (0.0 to 1.0)
- Similarity Boost: Adherence to original voice (0.0 to 1.0)
- Style Exaggeration: Voice style control (0.0 to 1.0)
- Speed Control: ElevenLabs native speed (0.25x to 4.0x)
- Speaker Boost: Enhanced similarity processing
- Dual Control Sections: Audio Processing + Voice Settings
- Emotion Presets: One-click presets for Positive, Negative, Neutral
- π§ Emotion Override Toggle: Auto-apply emotion-specific settings or use manual controls
- Real-time Feedback: Shows original vs processed text when stuttering is applied
- Visual Indicators: Clear display when emotion override and stuttering are active
- Example Text Suggestions: Built-in examples for testing different emotions
- Responsive Design: Works on desktop and mobile
- Text Input β User enters text in the interface
- Sentiment Analysis β HuggingFace DistilBERT model analyzes the text:
# Uses: distilbert-base-uncased-finetuned-sst-2-english emotion, confidence = detect_emotion(text) # Returns: ('NEGATIVE', 0.89) or ('POSITIVE', 0.95)
- Emotion Mapping β System maps emotion to specific voice settings:
# Automatic voice parameter selection based on emotion settings = map_emotion(emotion, confidence)
- Text Preprocessing β For negative emotions (confidence > 70%):
# Applies stuttering to emotional words processed_text = add_stuttering_effects(text, emotion, confidence)
- ElevenLabs API Call β Sends processed text + emotion-specific voice settings
- Audio Processing β Additional local modulation for rate/pitch/volume
| Emotion | Stability | Similarity | Style | Speed | Rate | Pitch | Volume | Stuttering |
|---|---|---|---|---|---|---|---|---|
| Positive | 0.3 (varied) | 0.8 (high) | 0.2 (stylized) | 1.1x (faster) | 1.2x | +2 | +3dB | None |
| Negative | 0.2 (emotional) | 0.6 (lower) | 0.1 (subtle) | 0.8x (slower) | 0.85x | -2 | -3dB | Applied |
| Neutral | 0.5 (balanced) | 0.75 (standard) | 0.0 (natural) | 1.0x (normal) | 1.0x | 0 | 0dB | None |
- Backend: FastAPI, Python 3.8+, Uvicorn
- AI/ML: HuggingFace Transformers, PyTorch, DistilBERT
- Audio Processing: Librosa 0.10.1, SoundFile, Pydub, NumPy
- TTS Engine: ElevenLabs API v1
- Frontend: HTML5, CSS3, Vanilla JavaScript
- Audio Formats: MP3 (ElevenLabs) β WAV (Processing) β WAV (Output)
- Python 3.8+ (Recommended: 3.11)
- Conda Environment (cyberwatchdog or similar)
- ElevenLabs API Key (Get yours here)
- FFmpeg (for audio format conversion)
- Modern Web Browser (Chrome, Firefox, Safari, Edge)
git clone https://github.com/Mukulguptaiit/darwix_hackathon.git
cd darwix_hackathon# Using Conda (Recommended)
conda create -n cyberwatchdog python=3.11
conda activate cyberwatchdog
# Install dependencies
pip install -r requirements.txt# Create .env file
echo 'ELEVEN_API_KEY="your_elevenlabs_api_key_here"' > .env
# Or export directly
export ELEVEN_API_KEY="your_elevenlabs_api_key_here"# Activate environment and start server
conda activate cyberwatchdog
uvicorn main:app --host 127.0.0.1 --port 8000Open interface.html in your browser or visit:
file:///path/to/darwix_hackathon/interface.html
Negative Emotion Examples (Will trigger stuttering):
- "I feel really sad and I just want to cry"
- "I can't handle this anymore, I need help"
- "I'm sorry, I think I really messed up"
Positive Emotion Examples:
- "I'm so excited about this amazing project!"
- "What a wonderful day, I feel fantastic!"
- "This is absolutely incredible, I love it!"
Neutral Examples:
- "The weather forecast shows rain tomorrow."
- "Please complete the assignment by Friday."
- "The meeting is scheduled for 3 PM."
Generate speech with emotion detection and modulation.
Request Body:
{
"text": "I feel really sad and I just want to cry",
"rate": 0.8,
"pitch": -2,
"volume": -3,
"stability": 0.2,
"similarity_boost": 0.6,
"style": 0.1,
"speed": 0.8,
"use_speaker_boost": true
}Response:
{
"emotion": "NEGATIVE",
"confidence": 0.89,
"original_text": "I feel really sad and I just want to cry",
"processed_text": "I-I-I fe-fe-feel re-re-really sa-sa-sad and I-I-I ju-ju-just wa-wa-want to cry",
"stuttering_applied": true,
"params": {
"rate": 0.8,
"pitch": -2,
"volume": -3,
"stability": 0.2,
"similarity_boost": 0.6,
"style": 0.1,
"speed": 0.8,
"use_speaker_boost": true
},
"audio_file": "final.wav"
}Retrieve generated audio file.
- Rate Slider: Controls local speech speed (0.5x - 2.0x)
- Pitch Slider: Adjusts pitch in semitones (-12 to +12)
- Volume Slider: Modifies volume in decibels (-20dB to +20dB)
- Stability Slider: Voice consistency (0.0 - 1.0)
- Similarity Slider: Voice adherence (0.0 - 1.0)
- Style Slider: Voice exaggeration (0.0 - 1.0)
- Speed Slider: ElevenLabs speed (0.25x - 4.0x)
- Speaker Boost: Toggle for enhanced similarity
- π§ Auto-Apply Emotion Settings: Toggle to automatically override manual settings based on detected emotion
- When enabled: System applies emotion-specific presets automatically (overrides all sliders)
- When disabled: Uses your manual slider settings regardless of detected emotion
- Requires >70% confidence for emotion override to activate
- π Positive: Energetic, faster, higher pitch (Happy & Energetic - Fast, High Pitch)
- π’ Negative: Subdued, slower, lower pitch + stuttering (Sad & Emotional - Slow, Low Stability)
- π Neutral: Balanced, natural settings (Balanced & Natural - Default Settings)
ELEVEN_API_KEY=your_elevenlabs_api_keyThe system uses DistilBERT for emotion detection:
- Model:
distilbert-base-uncased-finetuned-sst-2-english - Task: Binary sentiment classification (POSITIVE/NEGATIVE)
- Device: Automatically detects MPS (Apple Silicon) or CPU
Default ElevenLabs voice: EXAVITQu4vr4xnSDxMaL
You can change this in main.py:
ELEVEN_VOICE_ID = "your_preferred_voice_id"def add_stuttering_effects(text, emotion, confidence):
if emotion == "NEGATIVE" and confidence > 0.7:
# Target emotional words and sentence beginnings
stutter_words = ['i', 'can', 'just', 'really', 'feel', 'think', 'know', 'want', 'need', 'sorry']
# Apply stuttering pattern: "word" β "wo-wo-word"
# Creates natural emotional speech effectsdef map_emotion(emotion, confidence):
# Dynamic parameter selection based on detected emotion
# Integrates with both ElevenLabs API and local processing- Text β Emotion Detection β Stuttering β ElevenLabs TTS
- MP3 Audio β Format Conversion β Librosa Processing
- Rate/Pitch/Volume Modulation β Final WAV Output
Server won't start:
# Check API key is set
echo $ELEVEN_API_KEY
# Verify conda environment
conda activate cyberwatchdog
which pythonAudio not playing:
- Ensure server is running on port 8000
- Check browser console for CORS errors
- Verify audio file generation in project directory
Stuttering not working:
- Test with clearly negative text
- Check emotion detection confidence (needs >70%)
- Verify processed_text in API response
ElevenLabs API errors:
- Validate API key format (starts with
sk_) - Check API quota and billing
- Ensure internet connectivity
- Emotion Detection: ~50ms per request
- Audio Generation: 1-3 seconds (depends on text length)
- Audio Processing: ~500ms for modulation
- Total Pipeline: 2-4 seconds end-to-end
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- ElevenLabs for high-quality TTS API
- HuggingFace for transformer models
- FastAPI for the excellent web framework
- Librosa for audio processing capabilities
For issues and questions:
- Open an issue on GitHub
- Check the troubleshooting section
- Review API documentation
β‘ Ready to create empathetic AI voices? Start with the Quick Start guide above!