Skip to content

Latest commit

 

History

History
32 lines (23 loc) · 4.84 KB

File metadata and controls

32 lines (23 loc) · 4.84 KB

StageParamsGroupBy

Configuration for grouping documents by field value. Stage Category: REDUCE Transformation: N documents → M groups (where M ≤ N) Purpose: Groups documents by a common field value, aggregating chunks back to parent objects. Essential for decompose/recompose workflows where chunks are searched individually then grouped to show context. Performance: Runs in API layer (fast stage, ~10-50ms for 100-500 docs). Integrates with optimizer for filter push-down before grouping. Future optimization will push grouping into Qdrant for 10-100x speedup. When to Use: - After chunk-level search to group back to objects - To deduplicate results by a field - To aggregate related documents - For decompose→search→recompose workflows - Show top N results per category/author/parent When NOT to Use: - For initial document retrieval (use FILTER stages: hybrid_search) - For ordering documents (use SORT stages: sort_relevance) - For enriching documents (use APPLY stages: document_enrich) - For expanding documents (use APPLY 1-N stages: taxonomy_enrich) Operational Behavior: - Fast stage: runs in API layer (no Engine delegation) - In-memory grouping: Python dict-based grouping - Groups documents with same field value - Sorts within groups by score (highest first) - Limits documents per group (configurable) - Reports metrics to ClickHouse for learned optimizations Common Pipeline Position: FILTER → SORT → REDUCE (this stage) Requirements: - group_by_field: REQUIRED - max_per_group: OPTIONAL, defaults to 10 - output_mode: OPTIONAL, defaults to "all" Use Cases: - Decompose/recompose: Search 50 scenes, group to 10 videos - Deduplication: Group by unique_id, keep top match - Analytics: Group by category, show top docs per category - Multi-tier results: Show top 3 products per brand Examples: - Group video scenes back to parent videos - Deduplicate search results by product_id - Show top 3 articles per author - Display best match per category

Properties

Name Type Description Notes
group_by_field str Field path to group documents by using dot notation. Documents with the same field value are grouped together. Common fields: 'source_object_id' (parent object from decomposition), 'root_object_id' (top-level ancestor in hierarchy), 'metadata.category' (nested categorical field), 'video_id' (media grouping), 'product_id' (e-commerce). Use dot notation for nested fields: 'metadata.user_id', 'lineage.source_id'. Performance: Indexed fields are faster for future Qdrant native grouping optimization. Template support: Use {{inputs.group_field}} for dynamic grouping. [optional] [default to 'source_object_id']
max_per_group int OPTIONAL. Maximum number of documents to keep per group. Documents are sorted by score (highest first) before limiting. Default: 10. Use 1 for deduplication (keeps only highest scoring doc per group). Use 3-5 for preview results (show top chunks per parent). Use 50+ for comprehensive results (show many chunks per parent). Performance: Lower values reduce response size and improve latency. Typical values: 1 (dedup), 5 (preview), 10 (default), 20 (detailed), 50 (comprehensive). [optional] [default to 10]
output_mode str OPTIONAL. Controls what documents are returned per group. 'first': Return only the top document per group (deduplication, fastest). Use for: unique results per group (e.g., one video per brand). 'all': Return all documents grouped by field (default, shows full context). Use for: showing chunks within each parent object. 'flatten': Return all documents as flat list (loses group structure). Use for: need all docs but don't care about grouping metadata. Default: 'all'. Performance: 'first' is fastest (smallest response), 'all' preserves grouping, 'flatten' is lightest (no group metadata). [optional] [default to 'all']

Example

from mixpeek.models.stage_params_group_by import StageParamsGroupBy

# TODO update the JSON string below
json = "{}"
# create an instance of StageParamsGroupBy from a JSON string
stage_params_group_by_instance = StageParamsGroupBy.from_json(json)
# print the JSON string representation of the object
print(StageParamsGroupBy.to_json())

# convert the object into a dict
stage_params_group_by_dict = stage_params_group_by_instance.to_dict()
# create an instance of StageParamsGroupBy from a dict
stage_params_group_by_from_dict = StageParamsGroupBy.from_dict(stage_params_group_by_dict)

[Back to Model list] [Back to API list] [Back to README]