Feature/youtube ingestion#13859
Open
R-Oussama-INFOMINEO wants to merge 32 commits intoinfiniflow:mainfrom
Open
Feature/youtube ingestion#13859R-Oussama-INFOMINEO wants to merge 32 commits intoinfiniflow:mainfrom
R-Oussama-INFOMINEO wants to merge 32 commits intoinfiniflow:mainfrom
Conversation
- Switch DOC_ENGINE from elasticsearch to infinity - Enable tei-cpu COMPOSE_PROFILES for local embedding - Set TEI_MODEL to BAAI/bge-small-en-v1.5 (CPU-friendly, 1.2GB) - Set TZ to Africa/Casablanca - Set DOC_BULK_SIZE=4 and EMBEDDING_BATCH_SIZE=8 for CPU performance - Add fix-tenant-embedding.sh: fixes @None binding bug in v0.24.0 when using local TEI/Builtin embedding provider
…ers, fix positions handling
- Add configurable whisper_backend to _fetch_transcript(): youtube-transcript-api (default), faster-whisper, openai-whisper, openai-api - Add _download_audio() helper via yt-dlp (m4a direct, no FFmpeg postprocessor) - Add _segments_from_whisper_result() shared helper - Wire parser_config through to call site in chunk() - Dockerfile.custom: add ffmpeg, faster-whisper, yt-dlp, HF SSL bypass - Dockerfile.custom: pre-download faster-whisper tiny model at build time - deploy-local.sh: add running image verification - pyproject.toml: declare yt-dlp, faster-whisper, openai-whisper deps
- Add whisper_backend, whisper_model, openai_api_key, video_title as optional fields to ParserConfig in validation_utils.py - Allows dataset creation with whisper config via API without Extra inputs are not permitted error
- Full pipeline: create dataset → ingest → parse → retrieve → display - Supports all 4 Whisper backends + PDF ingestion - MCP-ready function signatures - compare_backends() for side-by-side backend comparison - .env.test added to .gitignore to protect API keys
- Add market, trim, retrieval_date to ParserConfig validator - Auto-generate dataset names: Brand_Model_Year_Market_Trim_Type_YYYYMMDD_HHMM - Add MARKET_ISO_CODES dict with normalization helper - Add get_datasets_by_brand_model() with market + trim filters - Add retrieve_by_brand_model() with full metadata filtering - Re-ingest all 3 sources with correct metadata - Add 208_test.pdf Peugeot test file
…th brand/model/year/market/trim params
New functions: - create_web_dataset() — chunk_method=naive, source_type=Web - create_image_dataset() — chunk_method=picture, source_type=Images - ingest_html() — local file or URL, auto-download to /tmp - ingest_image() — local file or URL, auto MIME detection - run_web_pipeline() — full end-to-end HTML pipeline - run_image_pipeline() — full end-to-end image pipeline __main__ updated with Option I (HTML) and Option J (Images), both local file and URL variants documented. Tested and verified: - Opel_Corsa_2025_IE_All_Web — 1 chunk, similarity 0.6896, 5.7s - Peugeot_208_2023_FR_All_Images — 1 chunk, similarity 0.7118, 5.8s validation_utils.py — no changes needed: - chunk_method=picture already in allowed set - source_type is free-form str, no enum validation
Member
|
Conflicts |
…ama-INFOMINEO/ragflow into feature/youtube-ingestion
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR.
Type of change