Skip to content

Nebula: Introduce Nebula Transcription Service (Local Whisper + GPT Vision Integration) #108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions nebula/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -177,3 +177,6 @@ cython_debug/
.idea/

.DS_Store

# Ignore transcription temp files
nebula/src/transcript/temp/
120 changes: 118 additions & 2 deletions nebula/README.MD
Original file line number Diff line number Diff line change
@@ -1,3 +1,119 @@
# Nebula
# Nebula_Transcriber

TODO
Nebula_Transcriber is a lightweight transcription service for educational videos.
It uses **Whisper** locally for audio transcription and **GPT-4o Vision** to detect visible slide numbers from video frames.

---

## ✨ Features

- 🎥 Process `.m3u8` lecture video URLs (e.g., from TUM-Live)
- 🧠 Transcribe audio using local Whisper models
- 👁️ Detect slide numbers via GPT-4o Vision (Azure)
- ⚡ Stateless and fast — no database or external storage
- 🚀 Exposes a clean **FastAPI** interface

---

## 🧪 Local Development Setup

```bash
git clone https://github.com/ls1intum/edutelligence.git
cd edutelligence
git checkout feature/transcript
cd nebula/transcript
```

```bash
python -m venv .venv
.venv\Scripts\activate # on Windows
source .venv/bin/activate # on Unix/Mac

pip install -r requirements.txt
```

Create `llm_config.nebula.yml`:

```yaml
- id: azure-gpt-4-omni
type: azure_chat
api_key: <your-api-key>
api_version: 2024-02-15-preview
azure_deployment: gpt-4o
endpoint: https://<your-endpoint>.openai.azure.com/
```

Run the FastAPI app:

```bash
uvicorn app:app --reload --host 0.0.0.0 --port 5000
```

---

## 🐳 Docker Setup

```bash
cd src/transcript/docker
docker compose up --build
```

Ensure `llm_config.nebula.yml` is available in the container.

---

## 🚀 API Usage

**POST** `/api/lecture/{lectureId}/lecture-unit/{lectureUnitId}/nebula-transcriber`

```json
{
"videoUrl": "https://your.video.url/playlist.m3u8",
"lectureId": 1,
"lectureUnitId": 2
}
```

---

## 🎓 Getting a TUM-Live `.m3u8` Link

1. Open [https://live.rbg.tum.de](https://live.rbg.tum.de)
2. Open DevTools → Network tab → Filter by `.m3u8`
3. Copy full link (including `jwt`)
4. Use in `videoUrl`

---

## 🧹 Temporary Files

- Stored in `./temp` by default
- Automatically removed after transcription
- Configurable via `.env`

---

## 📁 Project Structure

```
nebula/
├── src/
│ └── transcript/
│ ├── app.py
│ ├── slide_utils.py
│ ├── whisper_utils.py
│ ├── config.py
│ ├── llm_utils.py
│ └── docker/
│ └── docker-compose.yml
├── llm_config.nebula.yml
```

---

## 🛠 Troubleshooting

- 404 from GPT Vision: Check Azure deployment name and API version
- Whisper FP16 warning: Ignored if using CPU
- FFmpeg error: Ensure it's installed and on PATH
- `proxies` error: Use OpenAI SDK ≤ 1.55.3 or strip extra config keys
10 changes: 10 additions & 0 deletions nebula/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
version: '3.8'

services:
transcriber:
build:
context: transcript
container_name: nebula-transcriber
ports:
- "5000:5000"
restart: unless-stopped
19 changes: 19 additions & 0 deletions nebula/src/llm_config.example.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
- api_key:
api_version: 2024-02-15-preview
azure_deployment: gpt-4o
capabilities:
context_length: 128000
gpt_version_equivalent: 4.5
image_recognition: true
input_cost: 5
json_mode: true
output_cost: 15
privacy_compliance: true
self_hosted: false
vendor: OpenAI
description: GPT 4 Omni on Azure
endpoint: "<your-endpoint>"
id: azure-gpt-4-omni
model: gpt-4o
name: GPT 4 Omni
type: azure_chat
24 changes: 24 additions & 0 deletions nebula/src/transcript/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
FROM python:3.9

# System dependencies
RUN apt-get update && apt-get install -y \
ffmpeg \
tesseract-ocr \
libgl1 \
libglib2.0-0 \
&& rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Copy all project files
COPY . /app

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Expose Flask port
EXPOSE 5000

# Run the app
CMD ["python", "app.py"]
25 changes: 25 additions & 0 deletions nebula/src/transcript/align_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
from typing import List, Dict, Tuple


def align_slides_with_segments(
segments: List[Dict],
slide_timestamps: List[Tuple[float, int]]
) -> List[Dict]:
"""Attach slide numbers to transcript segments based on timestamps."""
result = []

for segment in segments:
slide_number = 1 # Default if no matching timestamp
for ts, num in reversed(slide_timestamps):
if ts <= segment["start"]:
slide_number = num
break

result.append({
"startTime": segment["start"],
"endTime": segment["end"],
"text": segment["text"].strip(),
"slideNumber": slide_number,
})

return result
91 changes: 91 additions & 0 deletions nebula/src/transcript/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
import os
import uuid
import time
import logging
import traceback

from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from uvicorn.middleware.proxy_headers import ProxyHeadersMiddleware
from pydantic import BaseModel

from config import Config
from video_utils import download_video, extract_audio, extract_frames_at_timestamps
from whisper_utils import transcribe_with_local_whisper
from slide_utils import ask_gpt_for_slide_number
from align_utils import align_slides_with_segments

app = FastAPI()

# Trust X-Forwarded-* headers if behind a reverse proxy
app.add_middleware(ProxyHeadersMiddleware)

# Enable permissive CORS (adjust in production)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)

# Setup logging
logging.basicConfig(level=getattr(logging, Config.LOG_LEVEL))

# Ensure temp directories exist
Config.ensure_dirs()


class TranscribeRequest(BaseModel):
videoUrl: str


@app.get("/")
async def home():
return {"message": "FastAPI server is running!"}


@app.post("/start-transcribe")
async def start_transcribe(req: TranscribeRequest):
video_url = req.videoUrl
if not video_url:
raise HTTPException(status_code=400, detail="Missing videoUrl")

uid = str(uuid.uuid4())
video_path = os.path.join(Config.VIDEO_STORAGE_PATH, f"{uid}.mp4")
audio_path = os.path.join(Config.VIDEO_STORAGE_PATH, f"{uid}.wav")

try:
download_video(video_url, video_path)
extract_audio(video_path, audio_path)

transcription = transcribe_with_local_whisper(audio_path)
timestamps = [s["start"] for s in transcription["segments"]]
frames = extract_frames_at_timestamps(video_path, timestamps)

slide_timestamps = []
for ts, img_b64 in frames:
slide_number = ask_gpt_for_slide_number(img_b64)
if slide_number is not None:
slide_timestamps.append((ts, slide_number))
time.sleep(2) # Respect GPT rate limit

segments = align_slides_with_segments(transcription["segments"], slide_timestamps)
result = {
"language": transcription.get("language", "en"),
"segments": segments,
}

return result

except Exception as e:
traceback.print_exc()
logging.error("Transcription failed", exc_info=True)
raise HTTPException(status_code=500, detail=str(e))

finally:
try:
os.remove(video_path)
os.remove(audio_path)
except Exception as cleanup_err:
logging.warning(f"Cleanup failed: {cleanup_err}")
14 changes: 14 additions & 0 deletions nebula/src/transcript/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
from pathlib import Path


class Config:
BASE_DIR = Path(__file__).resolve().parent
VIDEO_STORAGE_PATH = BASE_DIR / "temp"
WHISPER_MODEL = "base"
LOG_LEVEL = "INFO"
LLM_CONFIG_PATH = BASE_DIR / "llm_config.nebula.yml"

@staticmethod
def ensure_dirs() -> None:
"""Ensure required directories exist."""
Config.VIDEO_STORAGE_PATH.mkdir(parents=True, exist_ok=True)
36 changes: 36 additions & 0 deletions nebula/src/transcript/llm_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import os
import yaml
from openai import AzureOpenAI


def load_llm_config(filename="llm_config.nebula.yml", llm_id="azure-gpt-4-omni"):
"""Load LLM configuration from a YAML file."""
config_path = os.getenv("LLM_CONFIG_PATH")
if not config_path:
this_dir = os.path.dirname(os.path.abspath(__file__))
config_path = os.path.abspath(os.path.join(this_dir, "..", filename))

if not os.path.isfile(config_path):
raise FileNotFoundError(f"LLM config file not found at: {config_path}")

with open(config_path, "r", encoding="utf-8") as f:
config = yaml.safe_load(f)

for entry in config:
if entry.get("id") == llm_id:
return entry

raise ValueError(f"LLM config with ID '{llm_id}' not found.")


def get_openai_client(llm_id="azure-gpt-4-omni"):
"""Return an AzureOpenAI client and deployment name."""
config = load_llm_config(llm_id=llm_id)

client = AzureOpenAI(
azure_endpoint=config["endpoint"],
azure_deployment=config["azure_deployment"],
api_version=config["api_version"],
api_key=config["api_key"]
)
return client, config["azure_deployment"]
29 changes: 29 additions & 0 deletions nebula/src/transcript/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
fastapi==0.110.0
uvicorn[standard]==0.29.0

# OpenAI SDK supporting GPT-4o and Azure usage
openai==1.25.0 # Supports GPT-4o; stable + fast

# Whisper transcription model (latest from GitHub)
git+https://github.com/openai/whisper.git

# Whisper + Torch dependencies
torch

# Required for audio processing
ffmpeg-python==0.2.0

# For video frame extraction (headless avoids GUI issues)
opencv-python-headless==4.9.0.80

# For image handling and conversion to base64
pillow

# Needed by Whisper and other numerical ops
numpy<2.0.0 # Whisper does not yet support numpy 2.x

# HTTP utils (for things like downloading videos or calling external APIs)
requests==2.31.0

# YAML config loading (used in llm_config.nebula.yml)
pyyaml==6.0.1
Loading
Loading