AI-powered Automatic Speech Recognition service for Thai language transcription with multi-GPU parallel processing and advanced PII detection.
This service provides enterprise-grade ASR (Automatic Speech Recognition) capabilities specifically optimized for Thai language audio processing. Built with FastAPI and following hexagonal architecture principles, it offers multi-GPU parallel processing, stereo audio support, multiple ASR models, intelligent model selection, and comprehensive privacy protection.
- Multiple Model Support: Typhoon, Pathumma, and Pathumma-noise models
- Multi-GPU Parallel Processing: Process stereo audio channels simultaneously across multiple GPUs
- Stereo Audio Support: Separate Agent (left) and Caller (right) channel transcription
- Multiprocessing Architecture: CUDA-safe multiprocessing with spawn start method
- CPU Fallback: Automatic CPU support when GPU is unavailable
- Smart Model Selection: AI-powered model selection based on audio context
- Chunk-based Processing: Efficient handling of large audio files (>10 minutes)
- Memory Management: Lazy model loading with LRU eviction, optimized for production workloads
- Hanging Prevention: Queue timeouts and process termination for long-running tasks
- PII Detection: Automatic detection of personal information
- Entity Recognition: Names, phone numbers, emails, ID cards, dates of birth
- Data Masking: Automatic redaction of sensitive information
- Compliance Ready: Built for GDPR and data protection requirements
- QA Auditing: Automated quality assessment of transcriptions
- Consistency Checking: Cross-validation between model outputs
- Re-verification: Intelligent review of uncertain segments
- Performance Metrics: Detailed analytics and reporting
Following Hexagonal Architecture (Ports & Adapters):
src/
βββ agents/ # AI agents for specialized tasks
βββ api/ # REST API endpoints
βββ execution/ # Business logic and use cases
βββ models/ # ASR model implementations
βββ config/ # Configuration management
βββ utils/ # Utility functions
- Python 3.11
- Docker (optional)
- FFmpeg (for audio processing)
-
Clone the repository
git clone <repository-url> cd asr_service_server
-
Install dependencies with uv
uv sync
-
Configure environment
cp .env.example .env # Edit .env with your API keys and settings -
Run the service
uv run python -m src.api.main
# Development
docker-compose -f docker-compose.dev.yaml up
# Production
docker build -t asr-service .
docker run -p 3000:3000 asr-serviceOnce running, access the interactive API documentation at: http://localhost:3000/docs
POST /api/v1/process-unified-stereo
Content-Type: multipart/form-data
file: <stereo_wav_file>
force_model: typhoon|pathumma|pathumma_noise (optional)
skip_model_selection: true|false (default: false)
auto_continue: true|false (default: true)Features:
- Automatically detects and uses multiple GPUs for parallel processing
- Separates Agent (left) and Caller (right) channels
- Supports CPU fallback when GPU unavailable
- Handles files >10 minutes with chunked processing
POST /api/v1/process_wav_file
Content-Type: multipart/form-data
file: <wav_file>
with_transcription: true/falsePOST /api/v1/process_wav2file
Content-Type: multipart/form-data
file: <wav_file>
model: typhoon|pathumma|pathumma_noisePOST /api/v1/process_json_transcript
Content-Type: application/json
{
"transcript": {
"text": "transcription text",
"chunks": [...]
}
}GET /api/v1/transcription_sessionsPOST /api/v1/process_qa_auditor| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key | None |
DEEPSEEK_API_KEY |
DeepSeek API key | None |
SERVER_PORT |
Server port | 3000 |
SERVER_HOST |
Server host | 0.0.0.0 |
LOG_LEVEL |
Logging level | info |
REDIS_HOST |
Redis host | localhost |
REDIS_PORT |
Redis port | 6379 |
USE_ML_VAD |
Use ML-based voice activity detection | false |
The service automatically detects and uses available GPUs:
- 2+ GPUs: Assigns cuda:0 to Agent (left channel) and cuda:1 to Caller (right channel)
- 1 GPU: Both channels use the same GPU with parallel processing
- No GPU: Automatically falls back to CPU processing
Multiprocessing:
- Uses
spawnstart method for CUDA compatibility - Isolated ASR managers per worker process
- Per-process device assignment to prevent tensor device mismatch
- Queue timeouts and process termination for reliability
The service supports three ASR models:
- Typhoon: Fast, optimized for clear audio
- Pathumma: Balanced performance for general use
- Pathumma-noise: Enhanced for noisy environments (noise-resistant variant of Pathumma)
Pathumma-noise is a subclass of Pathumma with additional noise handling capabilities, making it ideal for:
- Call center recordings with background noise
- Outdoor recordings
- Low-quality audio sources
- Parallel Channel Processing: Agent and Caller transcribed simultaneously on separate GPUs
- Round-robin Distribution: Audio chunks distributed across available GPUs
- Per-Process Isolation: Each worker process has its own ASR manager to avoid CUDA race conditions
- Lazy Model Loading: Models loaded on-demand with LRU cache eviction
- Multi-GPU Support: Leverages 2+ GPUs for stereo audio processing
- CPU Fallback: Graceful degradation when GPU unavailable
- Horizontal Scaling: Docker-ready for container orchestration
- Async Processing: Non-blocking I/O operations for high throughput
- Hanging Prevention: Queue timeouts (30s) and process termination for long files
- Memory Management: Automatic model cache cleanup and resource cleanup
- Error Recovery: Retry mechanisms and graceful error handling
# Run tests
uv run pytest src/tests/
# Run specific test
uv run pytest src/tests/test_masker_action.py
# Test multi-GPU configuration
uv run pytest src/tests/test_gpu_cudax2.pyasr_transcribe_masking_service/
βββ src/
β βββ agents/ # AI agents and workflows
β β βββ prompts/ # Agent prompts and instructions
β β βββ workflows/ # LangGraph workflows
β β βββ tools/ # Agent tools
β βββ api/ # API endpoints
β β βββ endpoints/v1/ # Version 1 endpoints
β βββ execution/ # Business logic
β β βββ actions/ # Action implementations
β β βββ usecases/ # Use case orchestrators
β βββ models/ # ASR model implementations
β βββ config/ # Configuration
β βββ utils/ # Utilities
βββ tests/ # Test files
βββ docker-compose.dev.yaml
βββ Dockerfile
βββ pyproject.toml
- FastAPI: Modern web framework with async support
- Pydantic: Data validation and settings management
- Pydantic Settings: Configuration management
- LangChain/LangGraph: AI workflow orchestration
- Transformers: Hugging Face models and tokenizers
- PyTorch: Deep learning framework with CUDA support
- Typhoon ASR: Thai-specific ASR model
- Pathumma ASR: High-quality Thai transcription model
- Pathumma-noise ASR: Noise-resistant variant for challenging environments
- soundfile: Audio file I/O
- numpy: Numerical operations for audio data
- OpenAI: GPT models for AI agents
- DeepSeek: Alternative LLM provider
- Redis: Caching and queue management
- multiprocessing: Python parallel processing for multi-GPU support
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
For issues and questions:
- Check the API documentation
- Review logs in
src/logs/ - Open an issue in the repository
- Horizontal Scaling: Docker-ready for cloud deployment
- Monitoring: Comprehensive logging and metrics
- Security: API key authentication ready
- Compliance: GDPR and data protection compliant
- High Availability: Designed for 99.9% uptime
Built with β€οΈ for Thai language AI processing