Skip to content

Feature/youtube ingestion#13859

Open
R-Oussama-INFOMINEO wants to merge 32 commits intoinfiniflow:mainfrom
R-Oussama-INFOMINEO:feature/youtube-ingestion
Open

Feature/youtube ingestion#13859
R-Oussama-INFOMINEO wants to merge 32 commits intoinfiniflow:mainfrom
R-Oussama-INFOMINEO:feature/youtube-ingestion

Conversation

@R-Oussama-INFOMINEO
Copy link
Copy Markdown

What problem does this PR solve?

Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR.

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

- Switch DOC_ENGINE from elasticsearch to infinity
- Enable tei-cpu COMPOSE_PROFILES for local embedding
- Set TEI_MODEL to BAAI/bge-small-en-v1.5 (CPU-friendly, 1.2GB)
- Set TZ to Africa/Casablanca
- Set DOC_BULK_SIZE=4 and EMBEDDING_BATCH_SIZE=8 for CPU performance
- Add fix-tenant-embedding.sh: fixes @None binding bug in v0.24.0
  when using local TEI/Builtin embedding provider
- Add configurable whisper_backend to _fetch_transcript():
  youtube-transcript-api (default), faster-whisper, openai-whisper, openai-api
- Add _download_audio() helper via yt-dlp (m4a direct, no FFmpeg postprocessor)
- Add _segments_from_whisper_result() shared helper
- Wire parser_config through to call site in chunk()
- Dockerfile.custom: add ffmpeg, faster-whisper, yt-dlp, HF SSL bypass
- Dockerfile.custom: pre-download faster-whisper tiny model at build time
- deploy-local.sh: add running image verification
- pyproject.toml: declare yt-dlp, faster-whisper, openai-whisper deps
- Add whisper_backend, whisper_model, openai_api_key, video_title
  as optional fields to ParserConfig in validation_utils.py
- Allows dataset creation with whisper config via API without
  Extra inputs are not permitted error
- Full pipeline: create dataset → ingest → parse → retrieve → display
- Supports all 4 Whisper backends + PDF ingestion
- MCP-ready function signatures
- compare_backends() for side-by-side backend comparison
- .env.test added to .gitignore to protect API keys
- Add market, trim, retrieval_date to ParserConfig validator
- Auto-generate dataset names: Brand_Model_Year_Market_Trim_Type_YYYYMMDD_HHMM
- Add MARKET_ISO_CODES dict with normalization helper
- Add get_datasets_by_brand_model() with market + trim filters
- Add retrieve_by_brand_model() with full metadata filtering
- Re-ingest all 3 sources with correct metadata
- Add 208_test.pdf Peugeot test file
New functions:
- create_web_dataset()  — chunk_method=naive, source_type=Web
- create_image_dataset() — chunk_method=picture, source_type=Images
- ingest_html()         — local file or URL, auto-download to /tmp
- ingest_image()        — local file or URL, auto MIME detection
- run_web_pipeline()    — full end-to-end HTML pipeline
- run_image_pipeline()  — full end-to-end image pipeline

__main__ updated with Option I (HTML) and Option J (Images),
both local file and URL variants documented.

Tested and verified:
- Opel_Corsa_2025_IE_All_Web   — 1 chunk, similarity 0.6896, 5.7s
- Peugeot_208_2023_FR_All_Images — 1 chunk, similarity 0.7118, 5.8s

validation_utils.py — no changes needed:
- chunk_method=picture already in allowed set
- source_type is free-form str, no enum validation
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. 🐖api The modified files are located under directory 'api/apps/sdk' 💞 feature Feature request, pull request that fullfill a new feature. labels Mar 30, 2026
@yingfeng
Copy link
Copy Markdown
Member

Conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐖api The modified files are located under directory 'api/apps/sdk' 💞 feature Feature request, pull request that fullfill a new feature. size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants