Problem
youtube_yt.py:_fetch_transcript_ytdlp() (line 508) hardcodes --sub-lang en — only English auto-captions are attempted. For non-English content, this returns None and the video appears in results without a transcript.
The direct HTTP fallback (_fetch_transcript_direct(), line 470) already supports multi-language: it falls back to the first available caption track when no English track is found. But the primary yt-dlp path does not.
Current behavior:
- yt-dlp path:
--sub-lang en → English only → returns None for non-English videos
- Direct HTTP fallback: tries
en → en-* → first available track → works for non-English
Proposed Solution
Change --sub-lang en to --sub-lang en,es,pt in _fetch_transcript_ytdlp(). yt-dlp already supports comma-separated language lists — it tries each in order and downloads the first available.
# youtube_yt.py line 508, before:
"--sub-lang", "en",
# after:
"--sub-lang", "en,es,pt",
Optionally make this configurable via env var LAST30DAYS_YT_SUB_LANGS with default en,es,pt, read in env.py.
LLMs understand all three languages natively. A transcript in any of these is better than no transcript.
Estimated impact: +30-50% more transcripts captured, especially for non-English content.
Alternatives Considered
- Use only en,en-orig — catches original language when auto-caption exists, but misses non-English creators entirely
- Download ALL available languages — wastes bandwidth and storage for marginal gain
- Make it configurable without default — users won't configure it, same problem persists
- Rely on direct HTTP fallback only — fallback is less reliable (no yt-dlp retry logic, timeout handling)
Problem
youtube_yt.py:_fetch_transcript_ytdlp()(line 508) hardcodes--sub-lang en— only English auto-captions are attempted. For non-English content, this returnsNoneand the video appears in results without a transcript.The direct HTTP fallback (
_fetch_transcript_direct(), line 470) already supports multi-language: it falls back to the first available caption track when no English track is found. But the primary yt-dlp path does not.Current behavior:
--sub-lang en→ English only → returnsNonefor non-English videosen→en-*→ first available track → works for non-EnglishProposed Solution
Change
--sub-lang ento--sub-lang en,es,ptin_fetch_transcript_ytdlp(). yt-dlp already supports comma-separated language lists — it tries each in order and downloads the first available.Optionally make this configurable via env var
LAST30DAYS_YT_SUB_LANGSwith defaulten,es,pt, read inenv.py.LLMs understand all three languages natively. A transcript in any of these is better than no transcript.
Estimated impact: +30-50% more transcripts captured, especially for non-English content.
Alternatives Considered