Releases · lightspeed-core/lightspeed-evaluation

06 Apr 22:30

asamal4

v0.6.0

e439782

LightSpeed Evaluation v0.6.0 Latest

Latest

Key Changes

Panel of Judges: Use multiple LLMs as judges with configurable aggregation strategies (max, average, majority_vote)
LLM Pool & Dynamic Parameters: Centralized configuration for multiple LLMs with flexible parameter management and inheritance
Token Tracking Improvement: Per-judge and conversation-level token usage tracking
MCP Server Headers: Support for MCP Server specific headers with backward compatibility
Bug Fixes: Resolved DeepEval cache issues and metric metadata override behavior
Web Dashboard: Interactive visualization for exploring and comparing evaluation results (PoC - No Official Support)

Maintenance:

Package/Dependency Upgrade: Version change for the dependencies (key change: RAGAS 0.4.0)
Requirements Files: requirements*.txt files with pinned version
CPU-Only PyTorch Support: Reduced package size for local embedding model usage

Deprecation Notice:

The single llm: configuration is deprecated. Migrate to llm_pool + judge_panel configuration. See docs/configuration.md for details.

Pull Requests

bump up eval tool version v0.6.0 by @asamal4 in #182
[LEADS-218] Improve partial tool match reporting with detailed extra tool information by @bsatapat-jpg in #181
chore: refactor e2e tests by @VladimirKadlec in #183
[LEADS-199] Add CPU-only PyTorch support to reduce package size by @bsatapat-jpg in #185
add e2e lcore Makefile target by @VladimirKadlec in #186
[LEADS-192] feat: integrate panel of judges by @asamal4 in #184
[LEADS-276] chore: consistent config exception for config data model val by @asamal4 in #187
[LEADS-253] Deepevals cache miss fixed by @xmican10 in #188
chore: update docs for judge panel feature by @asamal4 in #191
feat: add web dashboard by @rioloc in #172
[LEADS-267] fix: metric metadata override only changes given key by @asamal4 in #192
[LEADS-292] chore: Add Claude skill for PR review by @asamal4 in #195
[LEADS-25] Support new MCP header by @Anxhela21 in #175
[LEADS-232] Bump up dependencies for Python 3.11, 3.12, 3.13 support by @bsatapat-jpg in #189
fix: set mcp header to false in example config by @asamal4 in #196
[LEADS-252] Conversation-level metrics API token usage logging in CSV by @xmican10 in #194
[LEADS-195] feat: Add majority voting & avg to panel aggregation by @asamal4 in #198
[LEADS-256] SQL Storage Backend Foundation by @bsatapat-jpg in #197
[LEADS-211] feat: ability to add/remove model parameters from config by @asamal4 in #200
Add uv.lock and requirements.txt for lsc_agent_eval by @eranco74 in #201
[LEADS-232] Embedding caching handling with ragas 0.4 by @bsatapat-jpg in #199
chore: update sample system yaml with new llm pool and judge panel by @asamal4 in #204
chore: pin upper bound in pyproject, add req text files & add make targets by @asamal4 in #203
chore: readme - Add missing new features & cleanup by @asamal4 in #207

New Contributors

@rioloc made their first contribution in #172
@eranco74 made their first contribution in #201

Full Changelog: v0.5.0...v0.6.0

Contributors

eranco74, rioloc, and 5 other contributors

Assets 2

06 Mar 14:05

asamal4

v0.5.0

f6d5f8f

LightSpeed Evaluation v0.5.0

Key Changes

Programmatic Integration: Use Lightspeed Evaluation as a Python library in your applications
Tool Results Evaluation: Evaluate tool execution results (regex)
User-Defined Metrics with Rubrics: Support for rubrics for custom GEval metrics
API Retry Mechanism: Automatic retries for API on rate-limit error
Token Tracking Fixes: Accurate token counting for JudgeLLM in multi-threaded evaluations and cached re-runs
SSL Certificate Verification: Add missing SSL verify for DeepEval integration
Output Fixes: Add missing metric_metadata values in CSV report
Error Handling Improvements: Skip metric when API populated fields are missing and set status as Error

Pull Requests

bump up eval tool version by @asamal4 in #152
Add ability to evaluate the tool results by @bparees in #151
Parse new LCORE tool_call format by @bparees in #150
remove unittest mock from streaming parser tests by @asamal4 in #154
Include toolcall result data in output by @bparees in #155
Add LiteLLM drop_params Support for Multi-Provider Compatibility by @arin-deloatch in #153
Add setup_ssl_verify() to DeepEvalLLMManager by @arin-deloatch in #158
[LEADS-217] Fix token count preservation on evaluation errors by @bsatapat-jpg in #157
add llm pool & judge panel config by @asamal4 in #156
[LEADS-236] chore: add exact folders to docstyle make target by @asamal4 in #162
[LEADS-235] chore: add repo approvers by @asamal4 in #161
[LEADS-230] fix: missing metric_metadata value in csv by @asamal4 in #163
[LEADS-191] feat: handle judge panel config in manager by @asamal4 in #164
[LEADS-208] Fix TokenTracker double-counting in multi-thread evaluation by @bsatapat-jpg in #159
chore: provide exact directories for black/ruff check by @asamal4 in #167
[LEADS-171] feat: add rubrics support for user-defined GEval metrics by @asamal4 in #166
[LEADS-242] chore: add data model for user-defined metrics by @asamal4 in #168
Replace identical data test with precise delta validation test by @asamal4 in #171
[LEADS-231] fix: set result status as Error when API populated field is missing by @asamal4 in #169
Leads 233 retry api call incase of error by @xmican10 in #173
feat: add e2e test with LSC api by @VladimirKadlec in #170
LEADS-240: Token usage should be 0 for a re-run with successful cache by @xmican10 in #176
[LEADS-226] chore: enforce pytest usage instead of unittest by @asamal4 in #178
feat: add programmatic API for library integration by @narmaku in #177
fix: add back pull_request_target trigger by @VladimirKadlec in #179
[LEADS-241] judge llm token counter for deepeval by @xmican10 in #180

New Contributors

@bparees made their first contribution in #151
@narmaku made their first contribution in #177

Full Changelog: v0.4.0...v0.5.0

Contributors

bparees, VladimirKadlec, and 5 other contributors

Assets 2

03 Feb 10:16

asamal4

v0.4.0

89b8ea7

LightSpeed Evaluation v0.4.0

What's Changed

Key Changes

Flexible Tool Evaluation: Configurable ordered/unordered & full/partial match modes for tool call validation
Classical Evaluation Metrics: Support for traditional evaluation metrics (bleu, rouge, distance metrics)
Alternate Expected Response: Ability to set alternate ground-truth responses for static evaluation metrics
Eval Configuration Tracking: Evaluation configuration details now included in generated reports for better reproducibility
API Latency Metrics: Latency tracking and reporting for API performance analysis (for API streaming endpoint)
Data Grouping: Tag-based grouping of evaluation conversations for better organization
Data Filtering: Filter evaluation datasets by tags and conversation IDs (CLI arguments) for targeted testing
Cache Warmup: New optional CLI argument to pre-warm (clear) caches before evaluation runs

Pull Requests

bump eval to v0.4.0 by @asamal4 in #128
fix: azure env variable names for judgeLLM by @asamal4 in #129
[LEADS-141] Add Latency Metrics to Evaluation Reports by @bsatapat-jpg in #127
chore: consolidate test_data models by @asamal4 in #131
chore: refactor generator & statistics module by @asamal4 in #132
Add optional property tag to group eval conversations by @asamal4 in #134
add git hooks by @VladimirKadlec in #133
[LEADS-172] Support classical evaluation metrics by @bsatapat-jpg in #130
fix: align docs for updated make targets by @asamal4 in #135
[LEADS-153] Adding the ordered matching logic in tool eval by @bsatapat-jpg in #136
[LEADS-153] Implement match logic (full/partial) by @bsatapat-jpg in #137
Remove duplicate data validation in pipeline by @asamal4 in #141
chore: refactor evaluation runner by @asamal4 in #140
feat: add data filter by tags & conv_ids by @asamal4 in #143
[LEADS-153] Wiring the configuration and adding the config in system.yaml by @bsatapat-jpg in #139
[LEADS-182] - Add eval config data to the report by @arin-deloatch in #142
Leads 6 set expected responses by @xmican10 in #138
map max_tokens to max_completion_tokens internally by @asamal4 in #144
fix: Do subset matching for full_match=false by @saswatamcode in #145
Enhance test quality by @xmican10 in #146
use .model_dump instead of .dict by @asamal4 in #147
add cache-warmup flag by @VladimirKadlec in #149
Leads 212 remove unittest mocking by @xmican10 in #148

New Contributors

@saswatamcode made their first contribution in #145

Full Changelog: v0.3.0...v0.4.0

Contributors

VladimirKadlec, saswatamcode, and 4 other contributors

Assets 2

30 Dec 18:28

asamal4

v0.3.0

0f8df44

LightSpeed Evaluation v0.3.0

What's Changed

Key Changes

Token Usage Statistics: Track and report token consumption during evaluations (both API and JudgeLLM usage)
Certificate Support for JudgeLLM: Configure custom certificates when connecting to Judge LLM endpoints
Skip on Failure: Optional config to skip remaining evaluations in a conversation group when any evaluation criteria fails
Optional Packages: torch and nvidia-* packages are now optional, significantly reducing install size for use cases that don't require them

PRs

bump eval version to 0.3.0 by @asamal4 in #113
docs: reorganize docs, add configuration docs by @VladimirKadlec in #111
Configuration base url update by @yangcao77 in #110
[LEADS-40]: Get statistics about the token usage for lightspeed-evaluation by @bsatapat-jpg in #112
LEADS-160: Adding python 3.13 compatibility by @bsatapat-jpg in #115
add additional fields to output for non-error scenarios by @asamal4 in #114
remove dynamic all by @asamal4 in #116
make agents.md more concise by @asamal4 in #117
add bandit to make target by @asamal4 in #118
chore: refactor processor & errors.py by @asamal4 in #119
[LEADS-119] code scanning found multiple security problems by @bsatapat-jpg in #122
Skip rest of the eval for an metric failure within a conversation group by @asamal4 in #121
Leads 44 certificates for judge llm by @xmican10 in #120
[LEADS-140] lightspeed-evaluation has dependency on torch and nvidia* packages that are not required for all usecases by @bsatapat-jpg in #123
doc: note for rhaiis, models.corp judgellm by @asamal4 in #124
chore: update docs/key features by @asamal4 in #125
doc: Add troubleshooting for known issues by @asamal4 in #126

New Contributors

@yangcao77 made their first contribution in #110
@xmican10 made their first contribution in #120

Full Changelog: v0.2.0...v0.3.0

Contributors

VladimirKadlec, yangcao77, and 3 other contributors

Assets 2

02 Dec 14:00

asamal4

v0.2.0

7665def

LightSpeed Evaluation v0.2.0

What's Changed

bump lightspeed evaluation version by @asamal4 in #78
LCORE-723: Added statistical comparision between two evaluation result files by @bsatapat-jpg in #74
remove unused LightspeedStackClient module by @asamal4 in #81
add agents.md by @asamal4 in #82
LCORE-417 Convert unittest mocking to pytest mocking by @max-svistunov in #84
Concurent eval by @VladimirKadlec in #85
LCORE-834: Added script to run evaluation across multiple providers and models by @bsatapat-jpg in #83
add .caches/ folder to gitignore by @asamal4 in #87
LCORE-899: created the evaluation methodology by @bsatapat-jpg in #88
remove archived OLS eval tool by @asamal4 in #86
add CLAUDE.md by @asamal4 in #89
add agent-eval deprecation note by @asamal4 in #91
LCORE-900: Added the parallel execution for multi-modal evaluation in… by @bsatapat-jpg in #92
Ability to set alternate tool calls for eval by @asamal4 in #90
LCORE-748: Addded unit test cases coverage for the evaluation framework by @bsatapat-jpg in #95
LEADS-113: Added support for gemini embedding models by @bsatapat-jpg in #99
LEADS-2: Fix Path Object Serialization in Amended YAML Files by @bsatapat-jpg in #100
handle no tool call alternative by @asamal4 in #101
LCORE-916: configuration for CodeRabbitAI by @tisnik in #103
GEval Integration by @arin-deloatch in #97
Add keyword eval metric by @asamal4 in #93
fix: run turn evaluation immediately after api call by @asamal4 in #105
LCORE-664: Section about AI tools by @tisnik in #107
LCORE-974: fixed issues found by Pyright by @tisnik in #108
LEADS-8: Lazy imports for eval tool by @bsatapat-jpg in #106
add support for fail_on_invalid_data option by @VladimirKadlec in #94
LEADS-26: Increased Unit test cases coverage by @bsatapat-jpg in #109

New Contributors

@max-svistunov made their first contribution in #84
@arin-deloatch made their first contribution in #97

Full Changelog: v0.1.0...v0.2.0

Contributors

tisnik, VladimirKadlec, and 4 other contributors

Assets 2

10 Oct 15:12

asamal4

v0.1.0

f92850a

LightSpeed Evaluation v0.1.0

What's Changed

initial copy of OLS eval by @asamal4 in #1
merge ols and road-core, first working version by @VladimirKadlec in #2
delete old scripts/evaluation, add README by @VladimirKadlec in #3
add evaluation datasets by @VladimirKadlec in #4
LCORE-162: Setup all CI all linters/checkers by @matysek in #5
Add some type hints into rag_eval.py by @tisnik in #6
Fixed docstrings by @tisnik in #7
Added type hints for functions without return value by @tisnik in #8
LCORE-276: Pin HTTPX version for now by @tisnik in #9
add generate answers tool by @VladimirKadlec in #10
Update dependencies by @tisnik in #12
Fix error: missing argument by @tisnik in #13
Check provider models by @tisnik in #14
fix readme reference post migration by @asamal4 in #11
LCORE: 210 Added Contribution Guide by @jrobertboos in #15
fix empty question, change retry strategy by @VladimirKadlec in #17
fix few lint issues by @asamal4 in #18
feat: add agent e2e eval by @asamal4 in #19
agent eval: verbose print and fixes by @asamal4 in #20
temp-fix: fix/suppress pyright issues by @asamal4 in #21
agent eval: multi-turn & refactoring by @asamal4 in #22
agent-eval: py version by @asamal4 in #23
Agent eval: add tool call comparison by @asamal4 in #24
update dependencies by @VladimirKadlec in #25
fix: streaming error handling by @asamal4 in #26
Generic eval tool by @asamal4 in #28
fix runner by @asamal4 in #31
use uv instead of pdm by @Anxhela21 in #30
Fix Bandit checker on CI by @tisnik in #32
archive old eval and make lsc eval as primary by @asamal4 in #35
switch to regex check for tool arg value by @asamal4 in #41
docs: Add input data to generate answers documentation by @are-ces in #36
fix rule for black & pydocstyle by @asamal4 in #45
Added Unit test cases as well as integration test cases by @bsatapat-jpg in #42
Add client for query endpoint by @Anxhela21 in #43
Feature: Add response_eval:intent evaluation type for LLM response intent assessment by @ItzikEzra-rh in #46
API integration & refactoring by @asamal4 in #47
[nit] Clean up evaluation_data.yaml by @lpiwowar in #52
fix: use uv pip instead of pip by @are-ces in #50
[LCORE-646] Disable default tracking in RAGAS by @lpiwowar in #49
[LCORE-648] Fix processing of float('NaN') values when OutputParserException by @lpiwowar in #48
allow none llm for LS API by @asamal4 in #53
feat: Added parallelism for answer generation by @are-ces in #39
update readme by @asamal4 in #54
fix: propagate arg output dir by @asamal4 in #57
Turn metric override by @asamal4 in #55
feat: add support for custom embedding model by @VladimirKadlec in #56
keep original input file intact by @asamal4 in #59
docs: add links to metrics docs by @VladimirKadlec in #60
Retrieved RAG context from lightspeed-stack API by @bsatapat-jpg in #58
Setting the execution bit only if it's not set by @andrej1991 in #61
provider vertex support for judge llm by @andrej1991 in #29
update tool call property by @asamal4 in #64
add vertex to main eval & refactor by @asamal4 in #63
Env setup/cleanup ability and verify through script by @asamal4 in #62
add example & check for vLLM hosted inference server by @asamal4 in #66
fix sample data by @asamal4 in #69
use absolute imports by @asamal4 in #68
fix: propagate api error message by @asamal4 in #72
add common custom llm by @asamal4 in #70
LCORE-723: Compute correct confidence interval by @bsatapat-jpg in #71
Simplify custom prompt handling & re-organize by @asamal4 in #73
add support for caching llm and api responses by @VladimirKadlec in #75
standardize file name as per framework name in metric by @asamal4 in #76
add intent eval by @asamal4 in #77

New Contributors

@asamal4 made their first contribution in #1
@VladimirKadlec made their first contribution in #2
@matysek made their first contribution in #5
@tisnik made their first contribution in #6
@jrobertboos made their first contribution in #15
@Anxhela21 made their first contribution in #30
@are-ces made their first contribution in #36
@bsatapat-jpg made their first contribution in #42
@ItzikEzra-rh made their first contribution in #46
@lpiwowar made their first contribution in #52
@andrej1991 made their first contribution in #61

Full Changelog: https://github.com/lightspeed-core/lightspeed-evaluation/commits/v0.1.0

Contributors

matysek, tisnik, and 9 other contributors

Assets 2

Releases: lightspeed-core/lightspeed-evaluation

LightSpeed Evaluation v0.6.0

Key Changes

Pull Requests

New Contributors

Contributors

Uh oh!

LightSpeed Evaluation v0.5.0

Key Changes

Pull Requests

New Contributors

Contributors

Uh oh!

LightSpeed Evaluation v0.4.0

What's Changed

Key Changes

Pull Requests

New Contributors

Contributors

Uh oh!

LightSpeed Evaluation v0.3.0

What's Changed

New Contributors

Contributors

Uh oh!

LightSpeed Evaluation v0.2.0

What's Changed

New Contributors

Contributors

Uh oh!

LightSpeed Evaluation v0.1.0

What's Changed

New Contributors

Contributors

Uh oh!