Flask API for Falcon ASR transcription workflows. The primary deployment is now split into two containers: a lightweight API service and a separate vLLM inference service.
POST /api/transcription/sessionsPOST /api/transcription/sessions/<session_id>/chunksGET /api/transcription/sessions/<session_id>/streamPOST /api/transcription/sessions/<session_id>/finalizeWS /api/transcription/sessions/wsPOST /api/transcription/transcribePOST /api/transcription/feedbackGET /api/admin/transcription/statsGET /health
The API owns sessions, streaming state, feedback records, uploaded audio artifacts, and database migrations. The vLLM service owns GPU inference.
vllmis the primary backend for deployment.torchis retained as a legacy self-contained runtime path.mlxis retained as an experimental local backend.
Production-facing docs and build files should prefer ASR_BACKEND=vllm unless they explicitly describe a legacy or experimental path.
Set the host path to the Falcon Audio v2 checkpoint, then start both services:
cp .env.example .env
# Edit FALCON_AUDIO_MODEL_HOST_PATH in .env if needed.
docker compose -f docker-compose.vllm.yml up --buildThe compose stack builds:
Dockerfile.vllmasfalcon-asr-vllm:local, serving OpenAI-compatible vLLM on port8000.Dockerfile.apiasfalcon-asr-api:local, serving the Flask API on port5000withASR_BACKEND=vllm.
Health checks:
curl -fsS http://127.0.0.1:8000/health
curl -fsS http://127.0.0.1:5000/healthSee docs/two-container-vllm-deployment.md for individual docker build and docker run commands.
Run the API locally against an already-running vLLM service:
python3 -m venv .venv-api
source .venv-api/bin/activate
pip install -r requirements-api.txt
export ASR_BACKEND=vllm
export VLLM_BASE_URL=http://127.0.0.1:8000
export VLLM_MODEL=/models/falcon_audio_v2_vllm
python -m flask --app wsgi.py db upgrade
python -m flask --app wsgi.py run --host 127.0.0.1 --port 5001Single-call transcription example:
curl -sS -X POST \
-H 'Authorization: Bearer replace-me' \
-F 'audio_file=@tests/fixtures/walt1-2.mp3;type=audio/mpeg' \
http://127.0.0.1:5001/api/transcription/transcribeDockerfile and requirements.txt remain for the older CUDA/Torch image that runs the API and inference in one container. Use this only when intentionally validating or operating the legacy path:
docker build -f Dockerfile -t falcon-arab-asr-api:legacy .
docker run --rm --gpus all \
-p 5000:5000 \
-v falcon_asr_api_usage:/falcon_asr_api_usage \
-e TRANSCRIPTION_API_KEYS=replace-me \
-e TRANSCRIPTION_ADMIN_API_KEYS=replace-me-admin \
falcon-arab-asr-api:legacycloudbuild.yaml builds and publishes the split images:
gcloud builds submit --config cloudbuild.yamlDefault image names:
falcon-asr-apifalcon-asr-vllm
Override region, repo, or names with substitutions:
gcloud builds submit \
--config cloudbuild.yaml \
--substitutions=_AR_REGION=us-central1,_AR_REPO=falcon-asr,_API_IMAGE_NAME=falcon-asr-api,_VLLM_IMAGE_NAME=falcon-asr-vllmLegacy unified-image release history is kept in docs/Docker-Release-History.md.
- Customer-facing API guide:
docs/Falcon-ASR-demo-API-documentation.md - Split deployment guide:
docs/two-container-vllm-deployment.md - vLLM shim notes:
docs/falcon_audio_v2_vllm_deployment.md - Experimental MLX notes:
mlx_porting.md
Authentication for deployed environments uses Authorization: Bearer <key>. X-API-Key remains accepted only as legacy compatibility.
- JSON content type enforced for session creation and feedback.
- Multipart content type enforced for chunk uploads and one-shot transcription.
- UUID validation for
session_idandtranscription_id. - Audio extension and MIME checks for uploads.
- Consistent 4xx response envelope:
error,code,status_code.
Install test dependencies in a local environment, then run:
python -m pytest -q tests/test_transcription_api.py tests/test_transcription_models.py tests/test_migrations.py tests/test_vllm_adapter.py