The system configuration is driven by YAML file. The default config file is config/system.yaml.
| Setting (core.) | Default | Description |
|---|---|---|
| max_threads | 50 |
Maximum number of threads, set to null for Python default. 50 is OK on a typical laptop. Check your Judge-LLM service for max requests per minute |
| fail_on_invalid_data | true |
If false don't fail on invalid conversations (like missing context field for some metrics) |
| skip_on_failure | false |
If true, skip remaining turns and conversation metrics when a turn evaluation fails (FAIL or ERROR). Can be overridden per conversation in the input data yaml file. |
core:
max_threads: 50
fail_on_invalid_data: true
skip_on_failure: false # Set to true to stop evaluation on first failureDefine a centralized pool of LLM configurations for the Judge Panel feature.
Note: The
llmconfig below will be deprecated. New deployments should usellm_pool+judge_panel.
| Setting | Description |
|---|---|
llm_pool.defaults.cache_dir |
Cache directory (default: .caches/llm_cache) |
llm_pool.defaults.timeout |
Request timeout in seconds (default: 300) |
llm_pool.defaults.num_retries |
Retry attempts (default: 3) |
llm_pool.defaults.parameters.temperature |
Sampling temperature |
llm_pool.defaults.parameters.max_completion_tokens |
Max tokens in response |
llm_pool.defaults.parameters.* |
Any additional provider-supported LLM parameter (e.g., top_p, frequency_penalty) |
llm_pool.models.<id>.provider |
LLM provider (required) |
llm_pool.models.<id>.model |
Model name |
llm_pool.models.<id>.parameters.* |
Model-specific parameter overrides (merged with defaults, model takes priority) |
Dynamic Parameters: The parameters dict accepts any key-value pair supported by the LLM provider. Known parameters: temperature, max_completion_tokens. Unsupported parameters are silently dropped by the provider.
Removing Defaults: To remove an inherited default parameter, set it to null in the model config:
models:
no-temp-model:
provider: openai
parameters:
temperature: null # Removes default temperature, won't be sent to APIllm_pool:
defaults:
cache_dir: ".caches/llm_cache"
num_retries: 2
parameters:
temperature: 0.0
max_completion_tokens: 512
models:
judge-4o-mini:
provider: openai
model: gpt-4o-mini
judge-4.1-mini:
provider: openai
model: gpt-4.1-mini
parameters:
temperature: 0.3 # Overrides defaultUse multiple LLMs as judges to reduce bias and improve evaluation accuracy.
| Setting | Description |
|---|---|
judge_panel.judges |
List of model IDs from llm_pool.models (required) |
judge_panel.enabled_metrics |
Metrics using full panel (if unset, all LLM metrics use panel) |
judge_panel.aggregation_strategy |
How to combine judge scores: max, average, or majority_vote (see below) |
Aggregation strategies (multiple judges only; errored judges are excluded):
| Strategy | Score | Pass / fail |
|---|---|---|
max (default) |
Highest score | Score vs metric threshold |
average |
Mean of scores | Mean vs metric threshold |
majority_vote |
Mean of scores | Strict majority of judges individually meet the metric threshold — more than half must pass (Ties fail). |
judge_panel:
judges:
- judge-4o-mini
- judge-4.1-mini
aggregation_strategy: max
# enabled_metrics: ["ragas:faithfulness"] # Optional: limit to specific metricsOutput: Includes aggregated score and judge_scores JSON array with individual results.
Deprecation notice: The
llmsection will be replaced byllm_pool+judge_panel.
This section configures LLM. It is used when judge_panel is not configured (even if llm_pool exists).
| Setting (llm.) | Default | Description |
|---|---|---|
| provider | "openai" |
LLM provider: openai, hosted_vllm, watsonx, azure, gemini |
| model | "gpt-4o-mini" |
Model name for the provider |
| ssl_verify | true |
Verify SSL certificates for specified provider |
| ssl_cert_file | null |
Path to custom CA certificate file (PEM format, merged with certifi defaults) |
| temperature | 0.0 |
Generation temperature |
| max_tokens | 512 |
Maximum tokens in response |
| timeout | 300 |
Request timeout in seconds |
| num_retries | 3 |
Maximum retry attempts |
| cache_dir | ".caches/llm_cache" |
Directory with cached LLM responses |
| cache_enabled | true |
Is LLM cache enabled? |
Dynamic LLM parameters are only supported through llm_pool config. To use dynamic parameters, migrate to llm_pool.
Note: For RHAIIS, models.corp, or other vLLM-based inference servers, use the hosted_vllm provider configuration. models.corp additionally requires certificate setup via ssl_cert_file configuration option.
Some Ragas metrics use embeddings to compute similarity between generated answers (or variants)
| Setting (embedding.) | Default | Description |
|---|---|---|
| provider | "openai" |
Supported providers: "openai", "gemini" or "huggingface". "huggingface" downloads the model to the local machine and runs inference locally (requires optional dependencies). |
| model | "text-embedding-3-small" |
Model name for the provider |
| provider_kwargs | {} |
Optional arguments for the model |
| cache_dir | ".caches/embedding_cache" |
Directory with cached embeddings |
| cache_enabled | true |
Is embeddings cache enabled? |
By default, lightspeed-evaluation uses remote embedding providers (OpenAI, Gemini) which require no additional dependencies and are lightweight to install.
Local embedding models (HuggingFace/sentence-transformers) are optional and require additional packages including PyTorch (~6GB). This is to avoid long download times and wasted disk space for users who only need remote embeddings.
To use local HuggingFace embeddings, install the optional dependencies:
# Using pip
pip install 'lightspeed-evaluation[local-embeddings]'
# Using uv
uv sync --extra local-embeddingsllm:
provider: openai
model: gpt-4o-mini
ssl_verify: true
ssl_cert_file: null
temperature: 0.0
max_tokens: 512
timeout: 300
num_retries: 3
cache_dir: ".caches/llm_cache"
cache_enabled: true
embedding:
provider: "openai"
model: "text-embedding-3-small"
provider_kwargs: {}
cache_dir: ".caches/embedding_cache"
cache_enabled: truellm:
provider: "gemini" # Judge-LLM Google Gemini
model: "gemini-1.5-pro"
temperature: 0.0
max_tokens: 512
timeout: 120
num_retries: 3
embedding:
provider: "huggingface"
model: "sentence-transformers/all-mpnet-base-v2"
provider_kwargs:
# cache_folder: <path_with_pre_downloaded_model>
model_kwargs:
device: "cpu" # Use "gpu" for nvidia accelerated inferenceThis section configures the inference API for generating the responses. It can be any Lightspeed-Core compatible API. Note that it can be easily integrated with other APIs with a minimal change.
| Setting (api.) | Default | Description |
|---|---|---|
| enabled | "true" |
Enable/disable API calls |
| api_base | "http://localhost:8080" |
Base API URL |
| endpoint_type | "streaming" |
streaming or query endpoint |
| timeout | 300 |
API request timeout in seconds |
| provider | "openai" |
LLM provider for API queries (optional) |
| model | "gpt-4o-mini" |
Model to use for API queries (optional) |
| no_tools | null |
Whether to bypass tools (optional) |
| system_prompt | null |
Custom system prompt (optional) |
| cache_dir | ".caches/api_cache" |
Directory with cached API responses |
| cache_enabled | true |
Is API cache enabled? |
| mcp_headers | null |
MCP headers configuration for authentication (see below) |
| num_retries | 3 |
Maximum number of retry attempts for API calls on 429 errors |
The framework supports two methods for MCP server authentication:
The mcp_headers configuration provides a flexible way to configure authentication for individual MCP servers:
| Setting (api.mcp_headers.) | Default | Description |
|---|---|---|
| enabled | true |
Enable/disable MCP headers functionality |
| servers | {} |
Dictionary of MCP server configurations |
For each server in servers, you can configure:
| Setting (api.mcp_headers.servers.<server_name>.) | Default | Description |
|---|---|---|
| env_var | required | Environment variable containing the authentication token |
| header_name | "Authorization" |
Custom header name (optional) |
When mcp_headers.enabled is false, the system falls back to using the API_KEY environment variable for all MCP server authentication.
- Real-time data generation: Queries are sent to external API
- Dynamic responses:
responseandtool_callsfields populated by API - Conversation context: Conversation context is maintained across turns
- Authentication: Use
API_KEYenvironment variable - Data persistence: Saves amended
responseandtool_callsto the output data file in the output directory so it can be re-used with API option disabled
- Static data mode: Use pre-filled
responseandtool_callsfrom the input data - Faster execution: No external API calls -- LLM as a judge are still called
- Reproducible results: Same response data used across runs
api:
enabled: true
api_base: http://localhost:8080
endpoint_type: streaming
timeout: 300
provider: openai
model: gpt-4o-mini
no_tools: null
system_prompt: null
cache_dir: ".caches/api_cache"
cache_enabled: true
num_retries: 3
# MCP Server Authentication Configuration
mcp_headers:
enabled: true # Enable MCP headers functionality
servers: # MCP server configurations
filesystem-tools:
env_var: API_KEY # Environment variable containing the token/key
another-mcp-server:
env_var: ANOTHER_API_KEY # Use a different environment variableImportant Note for lightspeed-stack API users: To use the MCP headers functionality with the lightspeed-stack API, you need to modify the llama_stack_api/openai_responses.py file in your lightspeed-stack installation:
In the OpenAIResponsesToolMCP class, change the authorization parameter's exclude field from True to False:
# In llama_stack_api/openai_responses.py
class OpenAIResponsesToolMCP:
authorization: Optional[str] = Field(
default=None,
exclude=False # Change this from True to False
)This change allows the authorization headers to be properly passed through to MCP servers.
Metrics are enabled globally (as described below) or within the input data for each individual conversation or individual turn (question/answer pair). To enable a metrics globally you need to set default meta data attribute to true
Metrics metadata are optional attributes for a given metric. Typically it contains the following:
default--trueorfalse, Is this metric is applied by default when no turn_metrics specified?threshold-- numerical value, if the returned metric value is greater or equal than the threshold the metric is marked aPASSin the results. If the returned metric value is lower it is marked asFAIL. In case of error it is markedERROR.description-- Description of the metric.
For GEval metrics (geval:...), you can also set:
criteria(required): Natural-language description of what to evaluate. GEval uses this to generate evaluation steps whenevaluation_stepsis not provided.evaluation_params: List of field names to include (e.g.query,response,expected_response). GEval auto-detect is not supported.evaluation_steps(optional): List of step-by-step instructions the LLM judge follows. If omitted, GEval generates steps fromcriteria. When provided together withrubrics, both are used: steps define how to evaluate, rubrics define score-range boundaries; neither overrides the other.rubrics(optional): List of{ score_range: [min, max], expected_outcome: "..." }. Score range is 0–10 inclusive; DeepEval expects non-overlapping ranges and validates. Confines the judge’s output to these ranges. The final score is normalized to a 0–1 range.
GEval returns a score in [0, 1].
By default no metrics are computed (default is set to false).
| Setting (metrics_metadata.) | Description |
|---|---|
| turn_level | Turn level metrics metadata |
| conversation_level | Conversation level metrics metadata |
For complete example with all metrics see the default config file config/system.yaml.
# Metrics Configuration with thresholds and defaults
metrics_metadata:
turn_level:
"ragas:response_relevancy":
threshold: 0.8
description: "How relevant the response is to the question"
default: true # Used by default when turn_metrics is null
"ragas:faithfulness":
threshold: 0.8
description: "How faithful the response is to the provided context"
default: false # Only used when explicitly specified in the input data
conversation_level:
"deepeval:conversation_completeness":
threshold: 0.8
description: "How completely the conversation addresses user intentions"Lightspeed Evaluation can persist results to files and/or databases. The storage section configures one or more storage backends.
The file backend generates CSV, JSON, and TXT reports.
| Setting (storage[type="file"].) | Default | Description |
|---|---|---|
| type | "file" |
Backend type (required) |
| output_dir | "./eval_output" |
Directory for output files |
| base_filename | "evaluation" |
Prefix for output filenames |
| enabled_outputs | ["csv", "json", "txt"] |
Output types to generate |
| csv_columns | all listed below | Columns to include in CSV |
| summary_config_sections | ["llm", "embedding", "api"] |
Config sections in summary |
Save results to a database for querying and analysis. Supports SQLite, PostgreSQL, and MySQL.
| Setting (storage[type="sqlite/postgres/mysql"].) | Default | Description |
|---|---|---|
| type | required | "sqlite", "postgres", or "mysql" |
| database | required | Database name or file path (SQLite) |
| table_name | "evaluation_results" |
Table name for results |
| host | required* | Database host (*required for postgres/mysql) |
| port | default per type | Database port (5432 for postgres, 3306 for mysql) |
| user | required* | Database user (*required for postgres/mysql) |
| password | required* | Database password (*required for postgres/mysql) |
Note: Database storage is incremental - results are saved as each conversation completes. Storage failures are logged as warnings but don't stop the evaluation.
Output type (in enabled_outputs) |
Description |
|---|---|
| csv | Detailed results CSV |
| json | Summary JSON with statistics |
| txt | Human-readable summary |
| CSV column name | Description |
|---|---|
| conversation_group_id | Conversation group id |
| tag | Tag for grouping eval conversations |
| turn_id | Turn id |
| metric_identifier | Metric name |
| result | Result -- PASS/FAIL/ERROR/SKIPPED |
| score | Score returned by the metric |
| threshold | Threshold from the setup |
| reason | Human readable description of the result |
| query | Original input query |
| response | Original input response (could be generated by Lightspeed Core API) |
| execution_time | Total time for processing the metric |
| api_input_tokens | Number of input tokens used in API call (see below) |
| api_output_tokens | Number of output tokens from API call (see below) |
| judge_llm_input_tokens | Number of input tokens used by Judge LLM |
| judge_llm_output_tokens | Number of output tokens from Judge LLM |
| embedding_tokens | Number of tokens used for embedding calls |
| tool_calls | Tool calls made during the turn (JSON format) |
| contexts | Context documents used for evaluation |
| expected_response | Expected response for comparison |
| expected_intent | Expected intent for intent evaluation |
| expected_keywords | Expected keywords for keyword matching |
| expected_tool_calls | Expected tool calls for tool evaluation |
| metric_metadata | Metric level metadata (excluding threshold & metric_identifier) |
For turn-level metrics, api_input_tokens and api_output_tokens reflect the token usage for that specific turn’s API call. For conversation-level metrics, these values are the sum of token usage across all turns in the conversation.
Note: The
api_input_tokensandapi_output_tokenscolumns repeat the same value for every metric row belonging to the same turn (or conversation). Summing these columns across all CSV rows will over-count. For accurate API token totals, use the summary statistics in the JSON or TXT reports.
storage:
- type: "file"
output_dir: "./eval_output"
base_filename: "evaluation"
enabled_outputs:
- "csv"
- "json"
- "txt"
csv_columns:
- "conversation_group_id"
- "turn_id"
- "metric_identifier"
- "result"
- "score"
- "reason"storage:
- type: "file"
output_dir: "./eval_output"
base_filename: "evaluation"
enabled_outputs:
- csv
- json
- type: "sqlite"
database: "./eval_results.db"
table_name: "evaluation_results"storage:
- type: "file"
output_dir: "./eval_output"
- type: "postgres"
database: "evaluations"
host: "localhost"
port: 5432
user: "admin"
password: "secret"Several output graphs summarizing output results are provided. The graphs are generated by matplotlib.
Note, in some specific cases the generation is slowed down by connecting matplotlib to the local X-server.
To workaround this set DISPLAY variable to some non-existing value.
| Setting (visualization.) | Default | Description |
|---|---|---|
| figsize | [12, 8] |
Graph size (width, height) in inches |
| dpi | 300 | The resolution of the figure in dots-per-inch. |
| enabled_graphs | all listed in the table below | List of the graphs to generate |
| Graph name | Description |
|---|---|
| pass_rates | Pass rate bar chart |
| score_distribution | Score distribution box plot |
| conversation_heatmap | Heatmap of conversation performance |
| status_breakdown | Pie chart for pass/fail/error breakdown |
visualization:
figsize: [12, 8]
dpi: 300
enabled_graphs:
- "pass_rates"
- "score_distribution"
- "conversation_heatmap"
- "status_breakdown"It is possible to configure value of environment variables. The variables are set before imports, affecting certain libraries/packages. See the example below.
environment:
DEEPEVAL_TELEMETRY_OPT_OUT: "YES" # Disable DeepEval telemetry
DEEPEVAL_DISABLE_PROGRESS_BAR: "YES" # Disable DeepEval progress bars
LITELLM_LOG: ERROR # Suppress LiteLLM verbose loggingLogging is highly configurable even for specific Python packages. Possible logging levels are:
- DEBUG, INFO, WARNING, ERROR, CRITICAL
| Setting (logging.) | Default | Description |
|---|---|---|
| source_level | INFO |
Source code logging level |
| package_level | ERROR |
Package logging level (imported libraries) |
| log_format | "%(asctime)s - %(name)s - %(levelname)s - %(message)s" |
Log format and display options |
| show_timestamps | true |
Show timestamps in logging messages? |
| package_overrides | none | List of specific package log levels (override package_level for specific libraries) |
logging:
source_level: INFO
package_level: ERROR
log_format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
show_timestamps: true
package_overrides:
httpx: ERROR
urllib3: ERROR
requests: ERROR
matplotlib: ERROR
LiteLLM: WARNING
DeepEval: WARNING
ragas: WARNING