Parameters for the feature extractor. Each extractor type has specific parameters. See the schema for your chosen extractor (e.g., MultimodalExtractorParams for multimodal_extractor).
| Name | Type | Description | Notes |
|---|---|---|---|
| extractor_type | str | Custom plugin extractor type (plugin name) | |
| source_type | str | Source content type. Use 'youtube' to resolve YouTube URLs to caption text before embedding. Default: 'text' (plain text input). | [optional] [default to 'text'] |
| split_by | TextSplitStrategy | Strategy for splitting text into multiple documents. | [optional] |
| chunk_size | int | Target size for each chunk (in units of chunk_strategy). | [optional] [default to 500] |
| chunk_overlap | int | Overlap between chunks to preserve context. | [optional] [default to 50] |
| segment_length_seconds | int | Length of each transcript segment in seconds (for time_segments split strategy). Shorter segments give more precise search results but more documents. | [optional] [default to 120] |
| language | str | Preferred language code for YouTube captions (when source_type='youtube'). | [optional] [default to 'en'] |
| extract_captions | bool | Extract auto-captions or manual subtitles from YouTube videos (when source_type='youtube'). Falls back to video description if False. | [optional] [default to True] |
| response_shape | ResponseShape3 | [optional] | |
| llm_provider | str | LLM provider for structured extraction: openai, google, anthropic | [optional] |
| llm_model | str | LLM model for structured extraction. | [optional] |
| llm_api_key | str | API key for LLM operations (BYOK - Bring Your Own Key). Supports: - Direct key: 'sk-proj-abc123...' - Secret reference: '{{SECRET.openai_api_key}}' When using secret reference, the key is loaded from your organization's secrets vault at runtime. Store secrets via POST /v1/organizations/secrets. If not provided, uses Mixpeek's default API keys. | [optional] |
| split_method | SplitMethod | The PRIMARY control for video splitting strategy. This determines which splitting method is used. | [optional] |
| description_prompt | str | The prompt to use for description generation. | [optional] [default to 'Describe the video segment in detail.'] |
| time_split_interval | int | Interval in seconds for 'time' splitting. Used when split_method='time'. | [optional] [default to 10] |
| silence_db_threshold | int | The decibel level below which audio is considered silent. Used when split_method='silence'. Recommended value: -40 (auto-applied if not specified). Lower values (e.g., -50) detect more silence, higher values (e.g., -30) detect less. | [optional] |
| scene_detection_threshold | float | Scene detection sensitivity (0.0-1.0). | [optional] [default to 0.3] |
| run_transcription | bool | Whether to run transcription on video segments. | [optional] [default to False] |
| transcription_language | str | The language of the transcription. Used when run_transcription is True. | [optional] [default to 'en'] |
| run_video_description | bool | Whether to generate descriptions for video segments. OPTIMIZED: Defaults to False as descriptions add 1-2 minutes. Enable only when needed. | [optional] [default to False] |
| run_transcription_embedding | bool | Whether to generate embeddings for transcriptions. Useful for semantic search over spoken content. | [optional] [default to False] |
| run_multimodal_embedding | bool | Whether to generate multimodal embeddings for all content types (video/image/gif/text). Uses Google Vertex AI to create unified 1408D embeddings in a shared semantic space. Useful for cross-modal semantic search across all media types. | [optional] [default to True] |
| run_ocr | bool | Whether to run OCR to extract text from video frames. OPTIMIZED: Defaults to False as OCR adds significant processing time. Enable only when text extraction from video is required. | [optional] [default to False] |
| sensitivity | str | The sensitivity of the scene detection. | [optional] [default to 'low'] |
| enable_thumbnails | bool | Whether to generate thumbnail images. | [optional] [default to True] |
| use_cdn | bool | Use CDN for thumbnail delivery. | [optional] [default to False] |
| generation_config | GenerationConfig | [optional] | |
| detection_model | str | SCRFD model for face detection. 'scrfd_500m': Fastest (2-3ms). 'scrfd_2.5g': Balanced (5-7ms), recommended. 'scrfd_10g': Highest accuracy (10-15ms). | [optional] [default to 'scrfd_2.5g'] |
| min_face_size | int | Minimum face size in pixels to detect. 20px: Balanced. 40px: Higher quality. 10px: Maximum recall. | [optional] [default to 20] |
| detection_threshold | float | Confidence threshold for face detection (0.0-1.0). | [optional] [default to 0.5] |
| max_faces_per_image | int | Maximum number of faces to process per image. None: Process all. | [optional] |
| normalize_embeddings | bool | L2-normalize embeddings to unit vectors (recommended). | [optional] [default to True] |
| enable_quality_scoring | bool | Compute quality scores (blur, size, landmarks). Adds ~5ms per face. | [optional] [default to True] |
| quality_threshold | float | Minimum quality score to index faces. None: Index all faces. 0.5: Moderate filtering. 0.7: High quality only. | [optional] |
| max_video_length | int | Maximum video length in seconds. 60: Default. 10: Recommended for retrieval. 300: Maximum (extraction only). | [optional] [default to 60] |
| video_sampling_fps | float | Frames per second to sample from video. 1.0: One frame per second (recommended). | [optional] [default to 1] |
| video_deduplication | bool | Remove duplicate faces across video frames (extraction only). Reduces 90-95% redundancy. NOT used in retrieval. | [optional] [default to True] |
| video_deduplication_threshold | float | Cosine similarity threshold for deduplication. 0.8: Conservative (default). | [optional] [default to 0.8] |
| output_mode | str | 'per_face': One document per face (recommended). 'per_image': One doc per image with faces array. | [optional] [default to 'per_face'] |
| include_face_crops | bool | Include aligned 112×112 face crops as base64. Adds ~5KB per face. | [optional] [default to False] |
| include_source_frame_thumbnail | bool | Include resized source frame/image as base64 thumbnail (~15-30KB per face). Used for display with bounding box overlay. | [optional] [default to False] |
| store_detection_metadata | bool | Store bbox, landmarks, detection scores. Recommended for debugging. | [optional] [default to True] |
| use_layout_detection | bool | Enable ML-based layout detection to find ALL document elements (text, images, tables, figures). When enabled, uses the configured layout_detector to detect and extract both text regions AND non-text elements (scanned images, figures, charts) as separate documents. Recommended for: Scanned documents, image-heavy PDFs, mixed content documents. When disabled: Falls back to text-only extraction (faster but misses images). Default: True (detects all elements including images). | [optional] [default to True] |
| layout_detector | str | Layout detection engine to use when use_layout_detection=True. 'pymupdf': Fast, rule-based detection using PyMuPDF heuristics (~15 pages/sec). 'docling': SOTA ML-based detection using IBM Docling with DiT model (~3-8 sec/doc). Docling advantages: Better semantic type detection (section_header vs paragraph), true table structure extraction (rows/cols), more accurate figure detection. PyMuPDF advantages: Much faster, lower memory usage, simpler dependencies. Default: 'pymupdf' for speed. Use 'docling' for accuracy-critical applications. | [optional] [default to 'pymupdf'] |
| vertical_threshold | float | Maximum vertical gap (in points) between lines to be grouped in same block. Increase for looser grouping, decrease for tighter blocks. Default 15pt works well for standard documents. | [optional] [default to 15] |
| horizontal_threshold | float | Maximum horizontal distance (in points) for overlap detection. Affects column detection and block merging. Increase for wider columns, decrease for narrow layouts. | [optional] [default to 50] |
| min_text_length | int | Minimum text length (characters) to keep a block. Blocks with less text are filtered out. Helps remove noise and tiny fragments. | [optional] [default to 20] |
| base_confidence | float | Base confidence score for embedded (native) text. Penalties are subtracted for OCR artifacts, encoding issues, etc. | [optional] [default to 0.85] |
| min_confidence_for_vlm | float | Confidence threshold below which VLM correction is triggered. Blocks with confidence < this value get sent to VLM for correction. Only applies when use_vlm_correction=True. | [optional] [default to 0.6] |
| use_vlm_correction | bool | Enable VLM (Vision Language Model) correction for low-confidence blocks. Uses Gemini/GPT-4V to correct OCR errors by analyzing the page image. Significantly slower (~1 page/sec) but improves accuracy for degraded docs. | [optional] [default to True] |
| fast_mode | bool | Skip VLM correction entirely for maximum throughput (~15 pages/sec). Overrides use_vlm_correction. Use when speed is more important than accuracy. | [optional] [default to False] |
| vlm_provider | str | VLM provider: 'google' (Gemini API) or 'vllm' (local GPU with Qwen2.5-VL). | [optional] [default to 'google'] |
| vlm_model | str | VLM model. For google: 'gemini-2.5-flash'. For vllm: 'Qwen/Qwen2.5-VL-7B-Instruct'. | [optional] [default to 'gemini-2.5-flash'] |
| run_text_embedding | bool | Generate E5 text embeddings (1024D) for transcripts and text. | [optional] [default to True] |
| render_dpi | int | DPI for page rendering (used for VLM correction). 72: Fast, lower quality. 150: Balanced (recommended). 300: High quality, slower. | [optional] [default to 150] |
| generate_thumbnails | bool | Generate thumbnail images for each learning unit. | [optional] [default to True] |
| thumbnail_mode | str | Thumbnail generation mode. 'full_page': Low-res thumbnail of entire page. 'segment': Cropped thumbnail of just the block's bounding box. 'both': Generate both types (recommended for flexibility). | [optional] [default to 'both'] |
| thumbnail_dpi | int | DPI for thumbnail generation. Lower DPI = smaller files. 72: Standard web quality. 36: Very small thumbnails. | [optional] [default to 72] |
| model_name | str | HuggingFace model name for sentiment classification | [optional] [default to 'distilbert-base-uncased-finetuned-sst-2-english'] |
| max_length | int | Maximum token length | [optional] [default to 512] |
| batch_size | int | Inference batch size | [optional] [default to 32] |
| return_all_scores | bool | Return scores for all classes, not just top | [optional] [default to True] |
| embed | bool | Generate E5 embeddings for semantic retrieval alongside classification. Uses the internal E5 embedding service for 1024-dimensional vectors. | [optional] [default to False] |
| max_depth | int | Maximum link depth to crawl. 0=seed page only, 1=seed+direct links, etc. Default: 2. Increase for comprehensive crawls, decrease for targeted extraction. | [optional] [default to 2] |
| max_pages | int | Maximum pages to crawl. Default: 50. Set higher (1000+) for large documentation sites. Max: 1,000,000. | [optional] [default to 50] |
| crawl_timeout | int | Maximum total time for crawling in seconds. Default: 300 (5 minutes). Increase for large sites with many pages. Max: 3600 (1 hour). | [optional] [default to 300] |
| crawl_mode | CrawlMode | Crawl strategy. DETERMINISTIC: BFS all links (predictable). SEMANTIC: LLM-guided, prioritizes relevant pages (requires crawl_goal). | [optional] |
| crawl_goal | str | Goal for semantic crawling. Only used when crawl_mode=SEMANTIC. Example: 'Find all S3 API documentation and examples' | [optional] |
| render_strategy | RenderStrategy | How to render pages. AUTO (default): tries static, falls back to JS. STATIC: fast HTTP fetch. JAVASCRIPT: Playwright browser for SPAs. | [optional] |
| include_patterns | List[str] | Regex patterns for URLs to include. Example: ['/docs/', '/api/'] | [optional] |
| exclude_patterns | List[str] | Regex patterns for URLs to exclude. Example: ['/blog/', '\.pdf$'] | [optional] |
| chunk_strategy | ChunkStrategy | How to split page content. NONE: one chunk per page. SENTENCES/PARAGRAPHS: semantic boundaries. WORDS/CHARACTERS: fixed size chunks. | [optional] |
| document_id_strategy | DocumentIdStrategy | How to generate document IDs. URL (default): stable across re-crawls. POSITION: order-based. CONTENT: deduplicates identical content. | [optional] |
| generate_text_embeddings | bool | Generate E5 embeddings for text content. | [optional] [default to True] |
| generate_code_embeddings | bool | Generate Jina code embeddings for code blocks. | [optional] [default to True] |
| generate_image_embeddings | bool | Generate SigLIP embeddings for images/figures. | [optional] [default to True] |
| generate_structure_embeddings | bool | Generate DINOv2 visual structure embeddings for layout comparison. | [optional] [default to True] |
| max_retries | int | Maximum retry attempts for failed HTTP requests. Uses exponential backoff with jitter. Default: 3. | [optional] [default to 3] |
| retry_base_delay | float | Base delay in seconds for retry backoff. Actual delay = base * 2^attempt + jitter. Default: 1.0. | [optional] [default to 1] |
| retry_max_delay | float | Maximum delay in seconds between retries. Default: 30. | [optional] [default to 30] |
| respect_retry_after | bool | Respect Retry-After header from 429/503 responses. If False, uses exponential backoff instead. Default: True. | [optional] [default to True] |
| proxies | List[str] | List of proxy URLs for rotation. Supports formats: 'http://host:port', 'http://user:pass@host:port', 'socks5://host:port'. Proxies rotate on errors or every N requests. | [optional] |
| rotate_proxy_on_error | bool | Rotate to next proxy when request fails. Default: True. | [optional] [default to True] |
| rotate_proxy_every_n_requests | int | Rotate proxy every N requests (0 = disabled). Useful for avoiding IP-based rate limits. Default: 0 (disabled). | [optional] [default to 0] |
| captcha_service_provider | str | Captcha solving service provider: '2captcha', 'anti-captcha', 'capsolver'. If not set, captcha pages are skipped gracefully. | [optional] |
| captcha_service_api_key | str | API key for captcha solving service. Supports secret reference: '{{SECRET.captcha_api_key}}'. Required if captcha_service_provider is set. | [optional] |
| detect_captcha | bool | Detect captcha challenges (Cloudflare, reCAPTCHA, hCaptcha). If detected and no solver configured, page is skipped. Default: True. | [optional] [default to True] |
| persist_cookies | bool | Persist cookies across requests within a crawl session. Useful for sites requiring authentication. Default: True. | [optional] [default to True] |
| custom_headers | Dict[str, str] | Custom HTTP headers to include in all requests. Example: {'Authorization': 'Bearer token', 'X-Custom': 'value'} | [optional] |
| delay_between_requests | float | Delay in seconds between consecutive requests. Useful for polite crawling and avoiding rate limits. Default: 0 (no delay). | [optional] [default to 0] |
| target_segment_duration_ms | int | Target duration for video segments in milliseconds. | [optional] [default to 120000] |
| min_segment_duration_ms | int | Minimum duration for video segments in milliseconds. | [optional] [default to 30000] |
| segmentation_method | str | Video segmentation method: 'scene', 'srt', or 'time'. | [optional] [default to 'scene'] |
| use_whisper_asr | bool | Use Whisper ASR for transcription instead of SRT subtitles. | [optional] [default to True] |
| expand_to_granular_docs | bool | Expand each segment into multiple granular documents. | [optional] [default to True] |
| ocr_frames_per_segment | int | Number of frames to OCR per video segment. | [optional] [default to 3] |
| pdf_extraction_mode | str | How to extract PDF content: 'per_page' or 'per_element'. | [optional] [default to 'per_element'] |
| pdf_render_dpi | int | DPI for rendering PDF pages/elements as images. | [optional] [default to 150] |
| detect_code_in_pdf | bool | Whether to detect code blocks in PDF text. | [optional] [default to True] |
| segment_functions | bool | Whether to segment code files into individual functions. | [optional] [default to True] |
| supported_languages | List[str] | Programming languages to extract from code archives. | [optional] |
| run_code_embedding | bool | Generate Jina Code embeddings (768D) for code snippets. | [optional] [default to True] |
| run_visual_embedding | bool | Generate SigLIP visual embeddings (768D) for video frames. | [optional] [default to True] |
| run_structure_embedding | bool | Generate DINOv2 visual structure embeddings (768D) for layout comparison. | [optional] [default to True] |
| visual_embedding_use_case | str | Content type preset for visual embedding strategy. | [optional] [default to 'lecture'] |
| extract_screen_text | bool | Run OCR on video frames to extract on-screen text. | [optional] [default to True] |
| run_vlm_frame_analysis | bool | Run VLM on video frame thumbnails to extract structured fields: frame_type, page_context, ui_labels, workflow_steps, config_options. Enables drift detection and UI comparison use cases. | [optional] [default to False] |
| enrich_with_llm | bool | Use Gemini to generate summaries and enhance descriptions. | [optional] [default to False] |
| llm_prompt | str | Prompt for LLM enrichment when enrich_with_llm=True. | [optional] [default to 'Summarize this educational content segment, highlighting key concepts.'] |
from mixpeek.models.parameters import Parameters
# TODO update the JSON string below
json = "{}"
# create an instance of Parameters from a JSON string
parameters_instance = Parameters.from_json(json)
# print the JSON string representation of the object
print(Parameters.to_json())
# convert the object into a dict
parameters_dict = parameters_instance.to_dict()
# create an instance of Parameters from a dict
parameters_from_dict = Parameters.from_dict(parameters_dict)