Feature/youtube ingestion by R-Oussama-INFOMINEO · Pull Request #13859 · infiniflow/ragflow

R-Oussama-INFOMINEO · 2026-03-30T14:43:33Z

What problem does this PR solve?

Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR.

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

- Switch DOC_ENGINE from elasticsearch to infinity - Enable tei-cpu COMPOSE_PROFILES for local embedding - Set TEI_MODEL to BAAI/bge-small-en-v1.5 (CPU-friendly, 1.2GB) - Set TZ to Africa/Casablanca - Set DOC_BULK_SIZE=4 and EMBEDDING_BATCH_SIZE=8 for CPU performance - Add fix-tenant-embedding.sh: fixes @None binding bug in v0.24.0 when using local TEI/Builtin embedding provider

… tasks

…ers, fix positions handling

…gestion

…ebuild

…llantis image

- Add configurable whisper_backend to _fetch_transcript(): youtube-transcript-api (default), faster-whisper, openai-whisper, openai-api - Add _download_audio() helper via yt-dlp (m4a direct, no FFmpeg postprocessor) - Add _segments_from_whisper_result() shared helper - Wire parser_config through to call site in chunk() - Dockerfile.custom: add ffmpeg, faster-whisper, yt-dlp, HF SSL bypass - Dockerfile.custom: pre-download faster-whisper tiny model at build time - deploy-local.sh: add running image verification - pyproject.toml: declare yt-dlp, faster-whisper, openai-whisper deps

- Add whisper_backend, whisper_model, openai_api_key, video_title as optional fields to ParserConfig in validation_utils.py - Allows dataset creation with whisper config via API without Extra inputs are not permitted error

…mage

- Full pipeline: create dataset → ingest → parse → retrieve → display - Supports all 4 Whisper backends + PDF ingestion - MCP-ready function signatures - compare_backends() for side-by-side backend comparison - .env.test added to .gitignore to protect API keys

- Add market, trim, retrieval_date to ParserConfig validator - Auto-generate dataset names: Brand_Model_Year_Market_Trim_Type_YYYYMMDD_HHMM - Add MARKET_ISO_CODES dict with normalization helper - Add get_datasets_by_brand_model() with market + trim filters - Add retrieve_by_brand_model() with full metadata filtering - Re-ingest all 3 sources with correct metadata - Add 208_test.pdf Peugeot test file

…ture

…ed retrieval

…th brand/model/year/market/trim params

New functions: - create_web_dataset() — chunk_method=naive, source_type=Web - create_image_dataset() — chunk_method=picture, source_type=Images - ingest_html() — local file or URL, auto-download to /tmp - ingest_image() — local file or URL, auto MIME detection - run_web_pipeline() — full end-to-end HTML pipeline - run_image_pipeline() — full end-to-end image pipeline __main__ updated with Option I (HTML) and Option J (Images), both local file and URL variants documented. Tested and verified: - Opel_Corsa_2025_IE_All_Web — 1 chunk, similarity 0.6896, 5.7s - Peugeot_208_2023_FR_All_Images — 1 chunk, similarity 0.7118, 5.8s validation_utils.py — no changes needed: - chunk_method=picture already in allowed set - source_type is free-form str, no enum validation

yingfeng · 2026-03-31T01:32:53Z

Conflicts

…ama-INFOMINEO/ragflow into feature/youtube-ingestion

R-Oussama-INFOMINEO added 30 commits March 25, 2026 11:42

docs: add deployment guide for WSL2 local and GCP production setup

f6053f9

feat(video): add ParserType.VIDEO to constants

af81ef5

feat(video): add YouTube transcript parser

090552c

feat(video): register video parser in FACTORY, bypass MinIO for video…

aa7da15

… tasks

feat(video): add youtube-transcript-api dependency

458c619

fix(video): update _fetch_transcript and tokenize call for v1.x API

864fb78

feat(video): add video metadata to Chunk model and retrieval serializ…

cdd86b3

…ers, fix positions handling

feat(video): register video parser across all API validation layers

9f9da65

fix(video): initialize st before MinIO bypass block

b0c85e5

feat(video): add video metadata columns to Infinity schema

7c81378

feat(video): add video fields to Infinity retrieval field list

7eda43c

feat(video): surface video metadata fields in retrieval response

23cc90a

feat(video): add POST /datasets/{id}/videos endpoint for clean URL in…

3ea2600

…gestion

feat(video): add youtube-transcript-api to uv.lock for Docker image r…

ac66eab

…ebuild

feat(video): add Dockerfile.custom and update .env to use ragflow-ste…

e2802ac

…llantis image

docs: update deployment guide with YouTube ingestion pipeline

e032de3

fix: SSL bypasses for openai-whisper and OpenAI SDK corporate proxy

a839790

fix: add whisper fields to ParserConfig Pydantic validator

c895973

- Add whisper_backend, whisper_model, openai_api_key, video_title as optional fields to ParserConfig in validation_utils.py - Allows dataset creation with whisper config via API without Extra inputs are not permitted error

feat: pre-download faster-whisper base model, add openai-whisper to i…

dab6fda

…mage

docs: update README with Whisper multi-backend documentation

56fb267

test: add Corsa PDF test file and clean up backend options

1f0a8c5

docs: add pipeline testing guide with real Corsa video + PDF results

11beb2d

docs: update README with dataset naming convention and metadata struc…

21e19ea

…ture

docs: update testing guide with new metadata structure and brand-scop…

c14cec9

…ed retrieval

fix: update run_video_pipeline, run_pdf_pipeline, compare_backends wi…

238a4f5

…th brand/model/year/market/trim params

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. 🐖api The modified files are located under directory 'api/apps/sdk' 💞 feature Feature request, pull request that fullfill a new feature. labels Mar 30, 2026

omarriad22 added 2 commits April 1, 2026 15:39

configure ragflow helm chart

db2956c

Merge branch 'feature/youtube-ingestion' of https://github.com/R-Ouss…

78ffa57

…ama-INFOMINEO/ragflow into feature/youtube-ingestion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/youtube ingestion#13859

Feature/youtube ingestion#13859
R-Oussama-INFOMINEO wants to merge 32 commits intoinfiniflow:mainfrom
R-Oussama-INFOMINEO:feature/youtube-ingestion

R-Oussama-INFOMINEO commented Mar 30, 2026

Uh oh!

yingfeng commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

R-Oussama-INFOMINEO commented Mar 30, 2026

What problem does this PR solve?

Type of change

Uh oh!

yingfeng commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants