Skip to content

Latest commit

 

History

History
42 lines (33 loc) · 9.32 KB

File metadata and controls

42 lines (33 loc) · 9.32 KB

StageParamsDocumentEnrich

Configuration for enriching documents with data from another collection. Stage Category: APPLY (1-1 Inner Join/Enrichment) Transformation: N documents → N documents (same count, expanded schema) Purpose: Applies each input document to a lookup operation in another collection, merging matching data back. This performs JOIN-like operations similar to SQL INNER JOIN. Each input document produces exactly one output document with added fields. When to Use: - After FILTER/SORT to add related reference data - To combine data from multiple collections (e.g., products + catalog info) - When documents need contextual information from other sources - For denormalizing data at query time instead of storage time - To attach user profiles, metadata, or related entities When NOT to Use: - For initial document retrieval (use FILTER stages: hybrid_search) - For removing documents (use FILTER stages) - For reordering results (use SORT stages) - When the target collection is very large (performance impact) - For 1-N joins that expand document count (use taxonomy with multi-match) Operational Behavior: - Applies each input document to a collection lookup (1-1 operation) - Performs database lookups for each document (MongoDB queries) - Maintains document count: N in → N out - Expands schema: adds fields from target collection - Moderate performance (depends on target collection size and indexes) - Left join semantics: missing matches result in null/absent fields Common Pipeline Position: FILTER → SORT → APPLY (this stage) Join Operation: This is a LEFT JOIN - all source documents are kept, enrichment fields are added when matches are found. Missing matches result in null/absent fields rather than document removal. Requirements: - target_collection_id: REQUIRED, collection to join with - source_field: REQUIRED, field in current documents to match - target_field: REQUIRED, field in target collection to match against - fields_to_merge: OPTIONAL, specific fields to merge (or entire document) - output_field: OPTIONAL, where to place enrichment (root or nested path) Use Cases: - Enrich product search results with full catalog data - Add user profile information to activity logs - Join cluster assignments with detailed metadata - Attach reference data (categories, taxonomies) to documents - Combine fragmented data across collections Examples: Basic field-based join: json { \"target_collection_id\": \"col_products\", \"source_field\": \"metadata.product_id\", \"target_field\": \"product_id\", \"fields_to_merge\": [\"name\", \"price\", \"category\"] } Nested field join with custom output: json { \"target_collection_id\": \"col_users\", \"source_field\": \"lineage.source_object_id\", \"target_field\": \"user_id\", \"output_field\": \"enrichments.user_profile\", \"fields_to_merge\": [\"name\", \"email\", \"role\"] } Conditional enrichment (only for specific categories): json { \"target_collection_id\": \"col_catalog\", \"source_field\": \"metadata.sku\", \"target_field\": \"sku\", \"fields_to_merge\": [\"description\", \"specs\"], \"when\": { \"field\": \"metadata.category\", \"operator\": \"eq\", \"value\": \"electronics\" } }

Properties

Name Type Description Notes
cache_behavior StageDefsStageCacheBehavior Controls internal caching behavior for this stage. OPTIONAL - defaults to 'auto' for transparent performance. 'auto' (default): Automatic caching for deterministic operations. Stage intelligently caches results based on inputs and parameters. Use for transformations, parsing, formatting, stable API calls. Cache invalidates automatically when parameters change. Recommended for 95% of use cases. 'disabled': Skip all internal caching. Every execution runs fresh without cache lookup. Use for templates with now(), random(), or external APIs that must be called every time (real-time data). No performance benefit but guarantees fresh execution. 'aggressive': Cache even non-deterministic operations. Use ONLY when you fully understand caching implications. May cache time-sensitive or random data. Generally not recommended - prefer 'auto' or 'disabled'. Note: This controls internal stage caching. Retriever-level caching (cache_config.cache_stage_names) is separate and caches complete stage outputs. [optional]
cache_ttl_seconds int Time-to-live for cache entries in seconds. OPTIONAL - defaults to None (LRU eviction only). When None (default, recommended): Cache uses Redis LRU eviction policy. Most frequently used items stay cached automatically. No manual TTL management needed. Memory bounded by Redis maxmemory setting. When specified: Cache entries expire after this duration regardless of usage. Useful for data that becomes stale after specific time periods. Lower values for frequently changing external data. Higher values for stable transformations. Examples: - None: LRU-based eviction (recommended for most cases) - 300: 5 minutes (for semi-static external data) - 3600: 1 hour (for stable transformations) - 86400: 24 hours (for rarely changing operations) Performance Note: TTL adds minimal overhead (<1ms) but forces eviction even for frequently accessed items. Use None unless you have specific staleness requirements. [optional]
retriever_id str ID of an existing retriever to use for finding enrichment data. When provided, uses the full retriever pipeline (semantic search, filters, etc.) instead of simple field matching. Mutually exclusive with retriever_config. [optional] [default to 'null']
retriever_config Dict[str, object] Anonymous retriever definition for finding enrichment data. Allows defining a custom retriever inline without creating it separately. Mutually exclusive with retriever_id. Must have 'stages' array with at least one stage. [optional]
retriever_inputs Dict[str, object] Template mapping from source document fields to retriever inputs. Supports template syntax: {{DOC.field_name}} to reference source document fields. Used when retriever_id or retriever_config is specified. [optional]
target_collection_id str Collection ID to fetch enrichment data from. REQUIRED for direct joins (when retriever_id/retriever_config not provided). Also used to scope retriever queries when retriever-based join is used. NOTE: You must replace the default placeholder with your actual collection ID. [optional] [default to '{{COLLECTION_ID}}']
source_field str Dot-path to field in current document to match on. REQUIRED for direct joins (when retriever_id/retriever_config not provided). For retriever-based joins, use retriever_inputs instead. [optional] [default to 'source_object_id']
target_field str Field in target collection to match against. REQUIRED for direct joins (when retriever_id/retriever_config not provided). Ignored for retriever-based joins. [optional] [default to 'document_id']
fields_to_merge List[str] Specific fields from target document to merge. If None, merges entire document. Supports dot-notation for nested fields. [optional] [default to null]
output_field str Dot-path where enrichment data should be placed. If None, merges directly into document root. Use 'enrichments.{name}' to namespace enrichments. [optional] [default to 'null']
strategy str How to handle the merge: 'enrich' = add fields to existing document, 'replace' = replace document with enriched version, 'append' = add as array item [optional] [default to 'enrich']
when StageDefsLogicalOperator Conditional filter to determine which documents should be enriched. Documents not matching the condition pass through unchanged. [optional]
allow_missing bool If True, documents without matching enrichment data pass through unchanged. If False, documents without matches are filtered out. [optional] [default to True]

Example

from mixpeek.models.stage_params_document_enrich import StageParamsDocumentEnrich

# TODO update the JSON string below
json = "{}"
# create an instance of StageParamsDocumentEnrich from a JSON string
stage_params_document_enrich_instance = StageParamsDocumentEnrich.from_json(json)
# print the JSON string representation of the object
print(StageParamsDocumentEnrich.to_json())

# convert the object into a dict
stage_params_document_enrich_dict = stage_params_document_enrich_instance.to_dict()
# create an instance of StageParamsDocumentEnrich from a dict
stage_params_document_enrich_from_dict = StageParamsDocumentEnrich.from_dict(stage_params_document_enrich_dict)

[Back to Model list] [Back to API list] [Back to README]