Parameters

Parameters for the feature extractor. Each extractor type has specific parameters. See the schema for your chosen extractor (e.g., MultimodalExtractorParams for multimodal_extractor).

Properties

Name	Type	Description	Notes
extractor_type	str	Custom plugin extractor type (plugin name)
source_type	str	Source content type. Use 'youtube' to resolve YouTube URLs to caption text before embedding. Default: 'text' (plain text input).	[optional] [default to 'text']
split_by	TextSplitStrategy	Strategy for splitting text into multiple documents.	[optional]
chunk_size	int	Target size for each chunk (in units of chunk_strategy).	[optional] [default to 500]
chunk_overlap	int	Overlap between chunks to preserve context.	[optional] [default to 50]
segment_length_seconds	int	Length of each transcript segment in seconds (for time_segments split strategy). Shorter segments give more precise search results but more documents.	[optional] [default to 120]
language	str	Preferred language code for YouTube captions (when source_type='youtube').	[optional] [default to 'en']
extract_captions	bool	Extract auto-captions or manual subtitles from YouTube videos (when source_type='youtube'). Falls back to video description if False.	[optional] [default to True]
response_shape	ResponseShape3		[optional]
llm_provider	str	LLM provider for structured extraction: openai, google, anthropic	[optional]
llm_model	str	LLM model for structured extraction.	[optional]
llm_api_key	str	API key for LLM operations (BYOK - Bring Your Own Key). Supports: - Direct key: 'sk-proj-abc123...' - Secret reference: '{{SECRET.openai_api_key}}' When using secret reference, the key is loaded from your organization's secrets vault at runtime. Store secrets via POST /v1/organizations/secrets. If not provided, uses Mixpeek's default API keys.	[optional]
split_method	SplitMethod	The PRIMARY control for video splitting strategy. This determines which splitting method is used.	[optional]
description_prompt	str	The prompt to use for description generation.	[optional] [default to 'Describe the video segment in detail.']
time_split_interval	int	Interval in seconds for 'time' splitting. Used when split_method='time'.	[optional] [default to 10]
silence_db_threshold	int	The decibel level below which audio is considered silent. Used when split_method='silence'. Recommended value: -40 (auto-applied if not specified). Lower values (e.g., -50) detect more silence, higher values (e.g., -30) detect less.	[optional]
scene_detection_threshold	float	Scene detection sensitivity (0.0-1.0).	[optional] [default to 0.3]
run_transcription	bool	Whether to run transcription on video segments.	[optional] [default to False]
transcription_language	str	The language of the transcription. Used when run_transcription is True.	[optional] [default to 'en']
run_video_description	bool	Whether to generate descriptions for video segments. OPTIMIZED: Defaults to False as descriptions add 1-2 minutes. Enable only when needed.	[optional] [default to False]
run_transcription_embedding	bool	Whether to generate embeddings for transcriptions. Useful for semantic search over spoken content.	[optional] [default to False]
run_multimodal_embedding	bool	Whether to generate multimodal embeddings for all content types (video/image/gif/text). Uses Google Vertex AI to create unified 1408D embeddings in a shared semantic space. Useful for cross-modal semantic search across all media types.	[optional] [default to True]
run_ocr	bool	Whether to run OCR to extract text from video frames. OPTIMIZED: Defaults to False as OCR adds significant processing time. Enable only when text extraction from video is required.	[optional] [default to False]
sensitivity	str	The sensitivity of the scene detection.	[optional] [default to 'low']
enable_thumbnails	bool	Whether to generate thumbnail images.	[optional] [default to True]
use_cdn	bool	Use CDN for thumbnail delivery.	[optional] [default to False]
generation_config	GenerationConfig		[optional]
detection_model	str	SCRFD model for face detection. 'scrfd_500m': Fastest (2-3ms). 'scrfd_2.5g': Balanced (5-7ms), recommended. 'scrfd_10g': Highest accuracy (10-15ms).	[optional] [default to 'scrfd_2.5g']
min_face_size	int	Minimum face size in pixels to detect. 20px: Balanced. 40px: Higher quality. 10px: Maximum recall.	[optional] [default to 20]
detection_threshold	float	Confidence threshold for face detection (0.0-1.0).	[optional] [default to 0.5]
max_faces_per_image	int	Maximum number of faces to process per image. None: Process all.	[optional]
normalize_embeddings	bool	L2-normalize embeddings to unit vectors (recommended).	[optional] [default to True]
enable_quality_scoring	bool	Compute quality scores (blur, size, landmarks). Adds ~5ms per face.	[optional] [default to True]
quality_threshold	float	Minimum quality score to index faces. None: Index all faces. 0.5: Moderate filtering. 0.7: High quality only.	[optional]
max_video_length	int	Maximum video length in seconds. 60: Default. 10: Recommended for retrieval. 300: Maximum (extraction only).	[optional] [default to 60]
video_sampling_fps	float	Frames per second to sample from video. 1.0: One frame per second (recommended).	[optional] [default to 1]
video_deduplication	bool	Remove duplicate faces across video frames (extraction only). Reduces 90-95% redundancy. NOT used in retrieval.	[optional] [default to True]
video_deduplication_threshold	float	Cosine similarity threshold for deduplication. 0.8: Conservative (default).	[optional] [default to 0.8]
output_mode	str	'per_face': One document per face (recommended). 'per_image': One doc per image with faces array.	[optional] [default to 'per_face']
include_face_crops	bool	Include aligned 112×112 face crops as base64. Adds ~5KB per face.	[optional] [default to False]
include_source_frame_thumbnail	bool	Include resized source frame/image as base64 thumbnail (~15-30KB per face). Used for display with bounding box overlay.	[optional] [default to False]
store_detection_metadata	bool	Store bbox, landmarks, detection scores. Recommended for debugging.	[optional] [default to True]
use_layout_detection	bool	Enable ML-based layout detection to find ALL document elements (text, images, tables, figures). When enabled, uses the configured layout_detector to detect and extract both text regions AND non-text elements (scanned images, figures, charts) as separate documents. Recommended for: Scanned documents, image-heavy PDFs, mixed content documents. When disabled: Falls back to text-only extraction (faster but misses images). Default: True (detects all elements including images).	[optional] [default to True]
layout_detector	str	Layout detection engine to use when use_layout_detection=True. 'pymupdf': Fast, rule-based detection using PyMuPDF heuristics (~15 pages/sec). 'docling': SOTA ML-based detection using IBM Docling with DiT model (~3-8 sec/doc). Docling advantages: Better semantic type detection (section_header vs paragraph), true table structure extraction (rows/cols), more accurate figure detection. PyMuPDF advantages: Much faster, lower memory usage, simpler dependencies. Default: 'pymupdf' for speed. Use 'docling' for accuracy-critical applications.	[optional] [default to 'pymupdf']
vertical_threshold	float	Maximum vertical gap (in points) between lines to be grouped in same block. Increase for looser grouping, decrease for tighter blocks. Default 15pt works well for standard documents.	[optional] [default to 15]
horizontal_threshold	float	Maximum horizontal distance (in points) for overlap detection. Affects column detection and block merging. Increase for wider columns, decrease for narrow layouts.	[optional] [default to 50]
min_text_length	int	Minimum text length (characters) to keep a block. Blocks with less text are filtered out. Helps remove noise and tiny fragments.	[optional] [default to 20]
base_confidence	float	Base confidence score for embedded (native) text. Penalties are subtracted for OCR artifacts, encoding issues, etc.	[optional] [default to 0.85]
min_confidence_for_vlm	float	Confidence threshold below which VLM correction is triggered. Blocks with confidence < this value get sent to VLM for correction. Only applies when use_vlm_correction=True.	[optional] [default to 0.6]
use_vlm_correction	bool	Enable VLM (Vision Language Model) correction for low-confidence blocks. Uses Gemini/GPT-4V to correct OCR errors by analyzing the page image. Significantly slower (~1 page/sec) but improves accuracy for degraded docs.	[optional] [default to True]
fast_mode	bool	Skip VLM correction entirely for maximum throughput (~15 pages/sec). Overrides use_vlm_correction. Use when speed is more important than accuracy.	[optional] [default to False]
vlm_provider	str	VLM provider: 'google' (Gemini API) or 'vllm' (local GPU with Qwen2.5-VL).	[optional] [default to 'google']
vlm_model	str	VLM model. For google: 'gemini-2.5-flash'. For vllm: 'Qwen/Qwen2.5-VL-7B-Instruct'.	[optional] [default to 'gemini-2.5-flash']
run_text_embedding	bool	Generate E5 text embeddings (1024D) for transcripts and text.	[optional] [default to True]
render_dpi	int	DPI for page rendering (used for VLM correction). 72: Fast, lower quality. 150: Balanced (recommended). 300: High quality, slower.	[optional] [default to 150]
generate_thumbnails	bool	Generate thumbnail images for each learning unit.	[optional] [default to True]
thumbnail_mode	str	Thumbnail generation mode. 'full_page': Low-res thumbnail of entire page. 'segment': Cropped thumbnail of just the block's bounding box. 'both': Generate both types (recommended for flexibility).	[optional] [default to 'both']
thumbnail_dpi	int	DPI for thumbnail generation. Lower DPI = smaller files. 72: Standard web quality. 36: Very small thumbnails.	[optional] [default to 72]
model_name	str	HuggingFace model name for sentiment classification	[optional] [default to 'distilbert-base-uncased-finetuned-sst-2-english']
max_length	int	Maximum token length	[optional] [default to 512]
batch_size	int	Inference batch size	[optional] [default to 32]
return_all_scores	bool	Return scores for all classes, not just top	[optional] [default to True]
embed	bool	Generate E5 embeddings for semantic retrieval alongside classification. Uses the internal E5 embedding service for 1024-dimensional vectors.	[optional] [default to False]
max_depth	int	Maximum link depth to crawl. 0=seed page only, 1=seed+direct links, etc. Default: 2. Increase for comprehensive crawls, decrease for targeted extraction.	[optional] [default to 2]
max_pages	int	Maximum pages to crawl. Default: 50. Set higher (1000+) for large documentation sites. Max: 1,000,000.	[optional] [default to 50]
crawl_timeout	int	Maximum total time for crawling in seconds. Default: 300 (5 minutes). Increase for large sites with many pages. Max: 3600 (1 hour).	[optional] [default to 300]
crawl_mode	CrawlMode	Crawl strategy. DETERMINISTIC: BFS all links (predictable). SEMANTIC: LLM-guided, prioritizes relevant pages (requires crawl_goal).	[optional]
crawl_goal	str	Goal for semantic crawling. Only used when crawl_mode=SEMANTIC. Example: 'Find all S3 API documentation and examples'	[optional]
render_strategy	RenderStrategy	How to render pages. AUTO (default): tries static, falls back to JS. STATIC: fast HTTP fetch. JAVASCRIPT: Playwright browser for SPAs.	[optional]
include_patterns	List[str]	Regex patterns for URLs to include. Example: ['/docs/', '/api/']	[optional]
exclude_patterns	List[str]	Regex patterns for URLs to exclude. Example: ['/blog/', '\.pdf$']	[optional]
chunk_strategy	ChunkStrategy	How to split page content. NONE: one chunk per page. SENTENCES/PARAGRAPHS: semantic boundaries. WORDS/CHARACTERS: fixed size chunks.	[optional]
document_id_strategy	DocumentIdStrategy	How to generate document IDs. URL (default): stable across re-crawls. POSITION: order-based. CONTENT: deduplicates identical content.	[optional]
generate_text_embeddings	bool	Generate E5 embeddings for text content.	[optional] [default to True]
generate_code_embeddings	bool	Generate Jina code embeddings for code blocks.	[optional] [default to True]
generate_image_embeddings	bool	Generate SigLIP embeddings for images/figures.	[optional] [default to True]
generate_structure_embeddings	bool	Generate DINOv2 visual structure embeddings for layout comparison.	[optional] [default to True]
max_retries	int	Maximum retry attempts for failed HTTP requests. Uses exponential backoff with jitter. Default: 3.	[optional] [default to 3]
retry_base_delay	float	Base delay in seconds for retry backoff. Actual delay = base * 2^attempt + jitter. Default: 1.0.	[optional] [default to 1]
retry_max_delay	float	Maximum delay in seconds between retries. Default: 30.	[optional] [default to 30]
respect_retry_after	bool	Respect Retry-After header from 429/503 responses. If False, uses exponential backoff instead. Default: True.	[optional] [default to True]
proxies	List[str]	List of proxy URLs for rotation. Supports formats: 'http://host:port', 'http://user:pass@host:port', 'socks5://host:port'. Proxies rotate on errors or every N requests.	[optional]
rotate_proxy_on_error	bool	Rotate to next proxy when request fails. Default: True.	[optional] [default to True]
rotate_proxy_every_n_requests	int	Rotate proxy every N requests (0 = disabled). Useful for avoiding IP-based rate limits. Default: 0 (disabled).	[optional] [default to 0]
captcha_service_provider	str	Captcha solving service provider: '2captcha', 'anti-captcha', 'capsolver'. If not set, captcha pages are skipped gracefully.	[optional]
captcha_service_api_key	str	API key for captcha solving service. Supports secret reference: '{{SECRET.captcha_api_key}}'. Required if captcha_service_provider is set.	[optional]
detect_captcha	bool	Detect captcha challenges (Cloudflare, reCAPTCHA, hCaptcha). If detected and no solver configured, page is skipped. Default: True.	[optional] [default to True]
persist_cookies	bool	Persist cookies across requests within a crawl session. Useful for sites requiring authentication. Default: True.	[optional] [default to True]
custom_headers	Dict[str, str]	Custom HTTP headers to include in all requests. Example: {'Authorization': 'Bearer token', 'X-Custom': 'value'}	[optional]
delay_between_requests	float	Delay in seconds between consecutive requests. Useful for polite crawling and avoiding rate limits. Default: 0 (no delay).	[optional] [default to 0]
target_segment_duration_ms	int	Target duration for video segments in milliseconds.	[optional] [default to 120000]
min_segment_duration_ms	int	Minimum duration for video segments in milliseconds.	[optional] [default to 30000]
segmentation_method	str	Video segmentation method: 'scene', 'srt', or 'time'.	[optional] [default to 'scene']
use_whisper_asr	bool	Use Whisper ASR for transcription instead of SRT subtitles.	[optional] [default to True]
expand_to_granular_docs	bool	Expand each segment into multiple granular documents.	[optional] [default to True]
ocr_frames_per_segment	int	Number of frames to OCR per video segment.	[optional] [default to 3]
pdf_extraction_mode	str	How to extract PDF content: 'per_page' or 'per_element'.	[optional] [default to 'per_element']
pdf_render_dpi	int	DPI for rendering PDF pages/elements as images.	[optional] [default to 150]
detect_code_in_pdf	bool	Whether to detect code blocks in PDF text.	[optional] [default to True]
segment_functions	bool	Whether to segment code files into individual functions.	[optional] [default to True]
supported_languages	List[str]	Programming languages to extract from code archives.	[optional]
run_code_embedding	bool	Generate Jina Code embeddings (768D) for code snippets.	[optional] [default to True]
run_visual_embedding	bool	Generate SigLIP visual embeddings (768D) for video frames.	[optional] [default to True]
run_structure_embedding	bool	Generate DINOv2 visual structure embeddings (768D) for layout comparison.	[optional] [default to True]
visual_embedding_use_case	str	Content type preset for visual embedding strategy.	[optional] [default to 'lecture']
extract_screen_text	bool	Run OCR on video frames to extract on-screen text.	[optional] [default to True]
run_vlm_frame_analysis	bool	Run VLM on video frame thumbnails to extract structured fields: frame_type, page_context, ui_labels, workflow_steps, config_options. Enables drift detection and UI comparison use cases.	[optional] [default to False]
enrich_with_llm	bool	Use Gemini to generate summaries and enhance descriptions.	[optional] [default to False]
llm_prompt	str	Prompt for LLM enrichment when enrich_with_llm=True.	[optional] [default to 'Summarize this educational content segment, highlighting key concepts.']

Example

from mixpeek.models.parameters import Parameters

# TODO update the JSON string below
json = "{}"
# create an instance of Parameters from a JSON string
parameters_instance = Parameters.from_json(json)
# print the JSON string representation of the object
print(Parameters.to_json())

# convert the object into a dict
parameters_dict = parameters_instance.to_dict()
# create an instance of Parameters from a dict
parameters_from_dict = Parameters.from_dict(parameters_dict)

[Back to Model list] [Back to API list] [Back to README]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameters

Properties

Example

FilesExpand file tree

Parameters.md

Latest commit

History

Parameters.md

File metadata and controls

Parameters

Properties

Example