Skip to content

Latest commit

 

History

History
47 lines (37 loc) · 6.82 KB

File metadata and controls

47 lines (37 loc) · 6.82 KB

StageParamsSummarize

Configuration for multi-document summarization. Stage Category: REDUCE Transformation: N documents → 1 document (or N → M with group_by) Purpose: Condense multiple documents into a single summary using an LLM. Unlike llm_enrich which processes each document independently, Summarize provides all documents to the LLM in a single call, enabling cross-document synthesis and comparison. When to Use: - Generate a single answer from search results (RAG output) - Create executive summaries from multiple sources - Synthesize information that spans multiple documents - Reduce result set to key findings When NOT to Use: - Adding fields to each document (use llm_enrich instead) - Simple filtering based on content (use llm_filter instead) - When you need to preserve individual documents Common Pipeline Position: feature_search → rerank → summarize Template Variables: - {{DOCUMENTS}}: Formatted list of all documents (required in prompt) - {{DOC_COUNT}}: Number of documents being summarized - {{INPUT.*}}: Access query inputs - {{CONTEXT.*}}: Access execution context Examples: Basic summarization: json { \"prompt\": \"Summarize these {{DOC_COUNT}} search results:\\n\\n{{DOCUMENTS}}\", \"provider\": \"google\", \"model_name\": \"gemini-2.0-flash\" } Question-answering from search results: json { \"prompt\": \"Answer this question: {{INPUT.question}}\\n\\nBased on these documents:\\n{{DOCUMENTS}}\", \"provider\": \"openai\", \"model_name\": \"gpt-4o\", \"include_sources\": true } Per-category summarization: json { \"prompt\": \"Summarize documents about {{GROUP_VALUE}}:\\n\\n{{DOCUMENTS}}\", \"provider\": \"openai\", \"model_name\": \"gpt-4o-mini\", \"group_by\": \"metadata.category\" }

Properties

Name Type Description Notes
prompt str REQUIRED. Prompt template for the LLM. Must include {{DOCUMENTS}} placeholder. Available placeholders: - {{DOCUMENTS}}: Formatted list of all documents - {{DOC_COUNT}}: Number of documents - {{GROUP_VALUE}}: Current group value (when using group_by) - {{INPUT.}}: Query input values - {{CONTEXT.}}: Execution context [optional] [default to '''Summarize the following {{DOC_COUNT}} documents concisely:

{{DOCUMENTS}}'''] provider | StageDefsLLMProvider | LLM provider to use. Supported providers: - openai: GPT models (GPT-4o, GPT-4o-mini) - google: Gemini models (Gemini 2.0 Flash) - anthropic: Claude models (Claude 3.5 Sonnet/Haiku) If not specified, defaults to 'google'. Can be auto-inferred from model_name. | [optional] model_name | str | Specific LLM model to use. If not specified, uses provider default. Examples: gemini-2.0-flash, gpt-4o-mini, gpt-4o | [optional] [default to 'null'] inference_name | str | DEPRECATED: Use 'provider' and 'model_name' instead. Legacy format: 'provider:model' (e.g., 'gemini:gemini-2.0-flash'). Kept for backward compatibility only. | [optional] document_template | str | OPTIONAL. Template for formatting each document in {{DOCUMENTS}}. Default: '[{{INDEX}}] {{DOC.content}}\n'. Available placeholders: - {{INDEX}}: 1-based document index - {{DOC.*}}: Any document field (e.g., {{DOC.content}}, {{DOC.metadata.title}}) | [optional] [default to '''[{{INDEX}}] {{DOC.content}} '''] content_field | str | OPTIONAL. Primary field to extract content from each document. Used when {{DOC.content}} is referenced in document_template. Supports dot notation for nested fields. | [optional] [default to 'content'] group_by | str | OPTIONAL. Field to group documents by before summarization. When set, creates one summary per unique group value (N→M transformation). When not set, creates one summary for all documents (N→1 transformation). Use cases: - 'metadata.category': One summary per category - 'metadata.source': One summary per source - 'metadata.date': One summary per date | [optional] [default to 'null'] output_field | str | OPTIONAL. Field name for the summary in the output document. Default: 'summary'. | [optional] [default to 'summary'] include_sources | bool | OPTIONAL. Include source document IDs in output. When true, adds 'source_document_ids' field to output. Useful for citation and attribution. | [optional] [default to True] include_metadata | bool | OPTIONAL. Include metadata about summarization in output. Adds 'document_count', 'tokens_used', etc. | [optional] [default to True] max_input_tokens | int | OPTIONAL. Maximum tokens to use for input documents. Documents exceeding this limit are truncated using truncation_strategy. Default: 8000 (safe for most models). | [optional] [default to 8000] truncation_strategy | str | OPTIONAL. How to handle documents exceeding max_input_tokens. Strategies: - 'drop_last': Include documents in order until limit, drop remaining - 'truncate_each': Give each document equal token budget, truncate individually - 'smart': Prioritize by relevance score, truncate lower-scored documents first | [optional] [default to 'drop_last'] temperature | float | OPTIONAL. LLM temperature for summary generation. Lower values (0.1-0.3) produce more focused, deterministic summaries. Higher values (0.7-1.0) produce more creative, varied summaries. Default: 0.3 (factual summarization). | [optional] [default to 0.3] max_output_tokens | int | OPTIONAL. Maximum tokens for the summary output. Default: 1024. | [optional] [default to 1024] output_schema | Dict[str, object] | OPTIONAL. JSON schema for structured output. When provided, LLM output is parsed as JSON matching this schema. | [optional]

Example

from mixpeek.models.stage_params_summarize import StageParamsSummarize

# TODO update the JSON string below
json = "{}"
# create an instance of StageParamsSummarize from a JSON string
stage_params_summarize_instance = StageParamsSummarize.from_json(json)
# print the JSON string representation of the object
print(StageParamsSummarize.to_json())

# convert the object into a dict
stage_params_summarize_dict = stage_params_summarize_instance.to_dict()
# create an instance of StageParamsSummarize from a dict
stage_params_summarize_from_dict = StageParamsSummarize.from_dict(stage_params_summarize_dict)

[Back to Model list] [Back to API list] [Back to README]