StageParamsCluster

Configuration for clustering documents from previous stage results. Stage Category: REDUCE Transformation: N documents → K clusters (where K < N typically) Purpose: Dynamically clusters documents from the pipeline by their embeddings. Unlike group_by which groups by a pre-existing field, cluster discovers natural groupings in the data based on vector similarity. Performance: Calls clustering inference service. Fast for typical retriever result sets (10-500 documents). For larger datasets, consider using pre-computed clusters with group_by instead. When to Use: - Discover themes/topics in search results - Group semantically similar documents without pre-existing labels - Analyze patterns in retrieved content - "Find the 3 main themes in these results" - Auto-categorize search results When NOT to Use: - When documents already have cluster/category labels (use group_by) - For very large result sets (>1000 docs) - use pre-computed clusters - When you need exact groupings (clustering is approximate) Output Modes: - "clusters": Returns K cluster summary documents with member lists - "labeled": Returns original N documents with cluster_label added - "representatives": Returns K representative documents (one per cluster) Common Pipeline Position: FILTER → cluster (this stage) → ENRICH (summarize clusters) Examples: - "Find 3 themes in 60 ads" → cluster with n_clusters=3 - "Group similar products" → cluster with algorithm=hdbscan (auto K) - "Discover topics in articles" → cluster with representatives output

Properties

Name	Type	Description	Notes
algorithm	str	Clustering algorithm to use: - hdbscan: Auto-determines number of clusters, handles noise (DEFAULT, recommended) - kmeans: Fast, requires n_clusters, spherical clusters - dbscan: Density-based, handles noise, requires eps tuning - agglomerative: Hierarchical, good for nested structures - spectral: Graph-based, good for non-convex clusters - gaussian_mixture: Probabilistic, soft cluster assignments Recommendation: Use 'hdbscan' for exploratory analysis, 'kmeans' when you know K.	[optional] [default to 'hdbscan']
n_clusters	int	Number of clusters to create. Required for kmeans, spectral, agglomerative, gaussian_mixture. Ignored for hdbscan and dbscan (auto-determined). If not specified for algorithms that need it, auto-calculated as min(8, N/10). Typical values: 3-5 for theme discovery, 5-10 for topic modeling, 10-20 for fine-grained categorization.	[optional] [default to null]
min_cluster_size	int	Minimum number of documents to form a cluster (HDBSCAN/DBSCAN only). Lower values = more clusters, may include noise. Higher values = fewer, denser clusters. Auto-adjusted for small datasets: min(min_cluster_size, N/3). Typical values: 3-5 for small results, 10-20 for large results.	[optional] [default to 5]
feature_uri	str	Feature URI specifying which embedding to cluster on. OPTIONAL - if not provided, auto-detects from the upstream feature_search stage. When a feature_search stage runs before cluster, its feature_uri is automatically tracked in the pipeline state and used for clustering. Use the mixpeek:// URI format: mixpeek://{extractor}@{version}/{output} Examples: - 'mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding' - 'mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1' - 'mixpeek://clip_extractor@v1/image_embedding' The feature_uri is resolved to the actual embedding field name on the documents (e.g., 'multimodal_extractor_v1_multimodal_embedding'). Only specify explicitly if you want to cluster on a different embedding than the one used in the feature_search stage.	[optional] [default to 'null']
output_mode	str	How to format the output: - 'clusters': Returns K cluster documents, each containing: - cluster_id: Cluster identifier - member_count: Number of documents in cluster - members: List of member documents - centroid: Cluster center vector Use for: Theme analysis, cluster summaries - 'labeled': Returns original N documents with added fields: - cluster_id: Assigned cluster - cluster_score: Distance to centroid (lower = closer) Use for: Downstream processing with cluster context - 'representatives': Returns K documents (one per cluster): - The document closest to each cluster centroid Use for: Quick sampling, representative examples	[optional] [default to 'clusters']
include_centroids	bool	Whether to include centroid vectors in output. Useful for downstream similarity comparisons or visualization. Set to False to reduce response size.	[optional] [default to True]
max_members_per_cluster	int	Maximum members to include per cluster in 'clusters' output mode. Documents are sorted by distance to centroid (closest first). Use to limit response size for large result sets.	[optional] [default to 50]

Example

from mixpeek.models.stage_params_cluster import StageParamsCluster

# TODO update the JSON string below
json = "{}"
# create an instance of StageParamsCluster from a JSON string
stage_params_cluster_instance = StageParamsCluster.from_json(json)
# print the JSON string representation of the object
print(StageParamsCluster.to_json())

# convert the object into a dict
stage_params_cluster_dict = stage_params_cluster_instance.to_dict()
# create an instance of StageParamsCluster from a dict
stage_params_cluster_from_dict = StageParamsCluster.from_dict(stage_params_cluster_dict)

[Back to Model list] [Back to API list] [Back to README]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StageParamsCluster

Properties

Example

FilesExpand file tree

StageParamsCluster.md

Latest commit

History

StageParamsCluster.md

File metadata and controls

StageParamsCluster

Properties

Example