Releases: lightspeed-core/lightspeed-evaluation
Releases · lightspeed-core/lightspeed-evaluation
LightSpeed Evaluation v0.6.0
Key Changes
- Panel of Judges: Use multiple LLMs as judges with configurable aggregation strategies (max, average, majority_vote)
- LLM Pool & Dynamic Parameters: Centralized configuration for multiple LLMs with flexible parameter management and inheritance
- Token Tracking Improvement: Per-judge and conversation-level token usage tracking
- MCP Server Headers: Support for MCP Server specific headers with backward compatibility
- Bug Fixes: Resolved DeepEval cache issues and metric metadata override behavior
- Web Dashboard: Interactive visualization for exploring and comparing evaluation results (PoC - No Official Support)
Maintenance:
- Package/Dependency Upgrade: Version change for the dependencies (key change: RAGAS 0.4.0)
- Requirements Files: requirements*.txt files with pinned version
- CPU-Only PyTorch Support: Reduced package size for local embedding model usage
Deprecation Notice:
- The single
llm:configuration is deprecated. Migrate tollm_pool+judge_panelconfiguration. See docs/configuration.md for details.
Pull Requests
- bump up eval tool version v0.6.0 by @asamal4 in #182
- [LEADS-218] Improve partial tool match reporting with detailed extra tool information by @bsatapat-jpg in #181
- chore: refactor e2e tests by @VladimirKadlec in #183
- [LEADS-199] Add CPU-only PyTorch support to reduce package size by @bsatapat-jpg in #185
- add e2e lcore Makefile target by @VladimirKadlec in #186
- [LEADS-192] feat: integrate panel of judges by @asamal4 in #184
- [LEADS-276] chore: consistent config exception for config data model val by @asamal4 in #187
- [LEADS-253] Deepevals cache miss fixed by @xmican10 in #188
- chore: update docs for judge panel feature by @asamal4 in #191
- feat: add web dashboard by @rioloc in #172
- [LEADS-267] fix: metric metadata override only changes given key by @asamal4 in #192
- [LEADS-292] chore: Add Claude skill for PR review by @asamal4 in #195
- [LEADS-25] Support new MCP header by @Anxhela21 in #175
- [LEADS-232] Bump up dependencies for Python 3.11, 3.12, 3.13 support by @bsatapat-jpg in #189
- fix: set mcp header to false in example config by @asamal4 in #196
- [LEADS-252] Conversation-level metrics API token usage logging in CSV by @xmican10 in #194
- [LEADS-195] feat: Add majority voting & avg to panel aggregation by @asamal4 in #198
- [LEADS-256] SQL Storage Backend Foundation by @bsatapat-jpg in #197
- [LEADS-211] feat: ability to add/remove model parameters from config by @asamal4 in #200
- Add uv.lock and requirements.txt for lsc_agent_eval by @eranco74 in #201
- [LEADS-232] Embedding caching handling with ragas 0.4 by @bsatapat-jpg in #199
- chore: update sample system yaml with new llm pool and judge panel by @asamal4 in #204
- chore: pin upper bound in pyproject, add req text files & add make targets by @asamal4 in #203
- chore: readme - Add missing new features & cleanup by @asamal4 in #207
New Contributors
Full Changelog: v0.5.0...v0.6.0
LightSpeed Evaluation v0.5.0
Key Changes
- Programmatic Integration: Use Lightspeed Evaluation as a Python library in your applications
- Tool Results Evaluation: Evaluate tool execution results (regex)
- User-Defined Metrics with Rubrics: Support for rubrics for custom GEval metrics
- API Retry Mechanism: Automatic retries for API on rate-limit error
- Token Tracking Fixes: Accurate token counting for JudgeLLM in multi-threaded evaluations and cached re-runs
- SSL Certificate Verification: Add missing SSL verify for DeepEval integration
- Output Fixes: Add missing metric_metadata values in CSV report
- Error Handling Improvements: Skip metric when API populated fields are missing and set status as Error
Pull Requests
- bump up eval tool version by @asamal4 in #152
- Add ability to evaluate the tool results by @bparees in #151
- Parse new LCORE tool_call format by @bparees in #150
- remove unittest mock from streaming parser tests by @asamal4 in #154
- Include toolcall result data in output by @bparees in #155
- Add LiteLLM drop_params Support for Multi-Provider Compatibility by @arin-deloatch in #153
- Add
setup_ssl_verify()toDeepEvalLLMManagerby @arin-deloatch in #158 - [LEADS-217] Fix token count preservation on evaluation errors by @bsatapat-jpg in #157
- add llm pool & judge panel config by @asamal4 in #156
- [LEADS-236] chore: add exact folders to docstyle make target by @asamal4 in #162
- [LEADS-235] chore: add repo approvers by @asamal4 in #161
- [LEADS-230] fix: missing metric_metadata value in csv by @asamal4 in #163
- [LEADS-191] feat: handle judge panel config in manager by @asamal4 in #164
- [LEADS-208] Fix TokenTracker double-counting in multi-thread evaluation by @bsatapat-jpg in #159
- chore: provide exact directories for black/ruff check by @asamal4 in #167
- [LEADS-171] feat: add rubrics support for user-defined GEval metrics by @asamal4 in #166
- [LEADS-242] chore: add data model for user-defined metrics by @asamal4 in #168
- Replace identical data test with precise delta validation test by @asamal4 in #171
- [LEADS-231] fix: set result status as Error when API populated field is missing by @asamal4 in #169
- Leads 233 retry api call incase of error by @xmican10 in #173
- feat: add e2e test with LSC api by @VladimirKadlec in #170
- LEADS-240: Token usage should be 0 for a re-run with successful cache by @xmican10 in #176
- [LEADS-226] chore: enforce pytest usage instead of unittest by @asamal4 in #178
- feat: add programmatic API for library integration by @narmaku in #177
- fix: add back pull_request_target trigger by @VladimirKadlec in #179
- [LEADS-241] judge llm token counter for deepeval by @xmican10 in #180
New Contributors
Full Changelog: v0.4.0...v0.5.0
LightSpeed Evaluation v0.4.0
What's Changed
Key Changes
- Flexible Tool Evaluation: Configurable ordered/unordered & full/partial match modes for tool call validation
- Classical Evaluation Metrics: Support for traditional evaluation metrics (bleu, rouge, distance metrics)
- Alternate Expected Response: Ability to set alternate ground-truth responses for static evaluation metrics
- Eval Configuration Tracking: Evaluation configuration details now included in generated reports for better reproducibility
- API Latency Metrics: Latency tracking and reporting for API performance analysis (for API streaming endpoint)
- Data Grouping: Tag-based grouping of evaluation conversations for better organization
- Data Filtering: Filter evaluation datasets by tags and conversation IDs (CLI arguments) for targeted testing
- Cache Warmup: New optional CLI argument to pre-warm (clear) caches before evaluation runs
Pull Requests
- bump eval to v0.4.0 by @asamal4 in #128
- fix: azure env variable names for judgeLLM by @asamal4 in #129
- [LEADS-141] Add Latency Metrics to Evaluation Reports by @bsatapat-jpg in #127
- chore: consolidate test_data models by @asamal4 in #131
- chore: refactor generator & statistics module by @asamal4 in #132
- Add optional property tag to group eval conversations by @asamal4 in #134
- add git hooks by @VladimirKadlec in #133
- [LEADS-172] Support classical evaluation metrics by @bsatapat-jpg in #130
- fix: align docs for updated make targets by @asamal4 in #135
- [LEADS-153] Adding the ordered matching logic in tool eval by @bsatapat-jpg in #136
- [LEADS-153] Implement match logic (full/partial) by @bsatapat-jpg in #137
- Remove duplicate data validation in pipeline by @asamal4 in #141
- chore: refactor evaluation runner by @asamal4 in #140
- feat: add data filter by tags & conv_ids by @asamal4 in #143
- [LEADS-153] Wiring the configuration and adding the config in system.yaml by @bsatapat-jpg in #139
- [LEADS-182] - Add eval config data to the report by @arin-deloatch in #142
- Leads 6 set expected responses by @xmican10 in #138
- map max_tokens to max_completion_tokens internally by @asamal4 in #144
- fix: Do subset matching for full_match=false by @saswatamcode in #145
- Enhance test quality by @xmican10 in #146
- use .model_dump instead of .dict by @asamal4 in #147
- add cache-warmup flag by @VladimirKadlec in #149
- Leads 212 remove unittest mocking by @xmican10 in #148
New Contributors
- @saswatamcode made their first contribution in #145
Full Changelog: v0.3.0...v0.4.0
LightSpeed Evaluation v0.3.0
What's Changed
Key Changes
- Token Usage Statistics: Track and report token consumption during evaluations (both API and JudgeLLM usage)
- Certificate Support for JudgeLLM: Configure custom certificates when connecting to Judge LLM endpoints
- Skip on Failure: Optional config to skip remaining evaluations in a conversation group when any evaluation criteria fails
- Optional Packages: torch and nvidia-* packages are now optional, significantly reducing install size for use cases that don't require them
PRs
- bump eval version to 0.3.0 by @asamal4 in #113
- docs: reorganize docs, add configuration docs by @VladimirKadlec in #111
- Configuration base url update by @yangcao77 in #110
- [LEADS-40]: Get statistics about the token usage for lightspeed-evaluation by @bsatapat-jpg in #112
- LEADS-160: Adding python 3.13 compatibility by @bsatapat-jpg in #115
- add additional fields to output for non-error scenarios by @asamal4 in #114
- remove dynamic all by @asamal4 in #116
- make agents.md more concise by @asamal4 in #117
- add bandit to make target by @asamal4 in #118
- chore: refactor processor & errors.py by @asamal4 in #119
- [LEADS-119] code scanning found multiple security problems by @bsatapat-jpg in #122
- Skip rest of the eval for an metric failure within a conversation group by @asamal4 in #121
- Leads 44 certificates for judge llm by @xmican10 in #120
- [LEADS-140] lightspeed-evaluation has dependency on torch and nvidia* packages that are not required for all usecases by @bsatapat-jpg in #123
- doc: note for rhaiis, models.corp judgellm by @asamal4 in #124
- chore: update docs/key features by @asamal4 in #125
- doc: Add troubleshooting for known issues by @asamal4 in #126
New Contributors
- @yangcao77 made their first contribution in #110
- @xmican10 made their first contribution in #120
Full Changelog: v0.2.0...v0.3.0
LightSpeed Evaluation v0.2.0
What's Changed
- bump lightspeed evaluation version by @asamal4 in #78
- LCORE-723: Added statistical comparision between two evaluation result files by @bsatapat-jpg in #74
- remove unused LightspeedStackClient module by @asamal4 in #81
- add agents.md by @asamal4 in #82
- LCORE-417 Convert unittest mocking to pytest mocking by @max-svistunov in #84
- Concurent eval by @VladimirKadlec in #85
- LCORE-834: Added script to run evaluation across multiple providers and models by @bsatapat-jpg in #83
- add .caches/ folder to gitignore by @asamal4 in #87
- LCORE-899: created the evaluation methodology by @bsatapat-jpg in #88
- remove archived OLS eval tool by @asamal4 in #86
- add CLAUDE.md by @asamal4 in #89
- add agent-eval deprecation note by @asamal4 in #91
- LCORE-900: Added the parallel execution for multi-modal evaluation in… by @bsatapat-jpg in #92
- Ability to set alternate tool calls for eval by @asamal4 in #90
- LCORE-748: Addded unit test cases coverage for the evaluation framework by @bsatapat-jpg in #95
- LEADS-113: Added support for gemini embedding models by @bsatapat-jpg in #99
- LEADS-2: Fix Path Object Serialization in Amended YAML Files by @bsatapat-jpg in #100
- handle no tool call alternative by @asamal4 in #101
- LCORE-916: configuration for CodeRabbitAI by @tisnik in #103
- GEval Integration by @arin-deloatch in #97
- Add keyword eval metric by @asamal4 in #93
- fix: run turn evaluation immediately after api call by @asamal4 in #105
- LCORE-664: Section about AI tools by @tisnik in #107
- LCORE-974: fixed issues found by Pyright by @tisnik in #108
- LEADS-8: Lazy imports for eval tool by @bsatapat-jpg in #106
- add support for fail_on_invalid_data option by @VladimirKadlec in #94
- LEADS-26: Increased Unit test cases coverage by @bsatapat-jpg in #109
New Contributors
- @max-svistunov made their first contribution in #84
- @arin-deloatch made their first contribution in #97
Full Changelog: v0.1.0...v0.2.0
LightSpeed Evaluation v0.1.0
What's Changed
- initial copy of OLS eval by @asamal4 in #1
- merge ols and road-core, first working version by @VladimirKadlec in #2
- delete old scripts/evaluation, add README by @VladimirKadlec in #3
- add evaluation datasets by @VladimirKadlec in #4
- LCORE-162: Setup all CI all linters/checkers by @matysek in #5
- Add some type hints into rag_eval.py by @tisnik in #6
- Fixed docstrings by @tisnik in #7
- Added type hints for functions without return value by @tisnik in #8
- LCORE-276: Pin HTTPX version for now by @tisnik in #9
- add generate answers tool by @VladimirKadlec in #10
- Update dependencies by @tisnik in #12
- Fix error: missing argument by @tisnik in #13
- Check provider models by @tisnik in #14
- fix readme reference post migration by @asamal4 in #11
- LCORE: 210 Added Contribution Guide by @jrobertboos in #15
- fix empty question, change retry strategy by @VladimirKadlec in #17
- fix few lint issues by @asamal4 in #18
- feat: add agent e2e eval by @asamal4 in #19
- agent eval: verbose print and fixes by @asamal4 in #20
- temp-fix: fix/suppress pyright issues by @asamal4 in #21
- agent eval: multi-turn & refactoring by @asamal4 in #22
- agent-eval: py version by @asamal4 in #23
- Agent eval: add tool call comparison by @asamal4 in #24
- update dependencies by @VladimirKadlec in #25
- fix: streaming error handling by @asamal4 in #26
- Generic eval tool by @asamal4 in #28
- fix runner by @asamal4 in #31
- use uv instead of pdm by @Anxhela21 in #30
- Fix Bandit checker on CI by @tisnik in #32
- archive old eval and make lsc eval as primary by @asamal4 in #35
- switch to regex check for tool arg value by @asamal4 in #41
- docs: Add input data to generate answers documentation by @are-ces in #36
- fix rule for black & pydocstyle by @asamal4 in #45
- Added Unit test cases as well as integration test cases by @bsatapat-jpg in #42
- Add client for query endpoint by @Anxhela21 in #43
- Feature: Add response_eval:intent evaluation type for LLM response intent assessment by @ItzikEzra-rh in #46
- API integration & refactoring by @asamal4 in #47
- [nit] Clean up evaluation_data.yaml by @lpiwowar in #52
- fix: use uv pip instead of pip by @are-ces in #50
- [LCORE-646] Disable default tracking in RAGAS by @lpiwowar in #49
- [LCORE-648] Fix processing of
float('NaN')values when OutputParserException by @lpiwowar in #48 - allow none llm for LS API by @asamal4 in #53
- feat: Added parallelism for answer generation by @are-ces in #39
- update readme by @asamal4 in #54
- fix: propagate arg output dir by @asamal4 in #57
- Turn metric override by @asamal4 in #55
- feat: add support for custom embedding model by @VladimirKadlec in #56
- keep original input file intact by @asamal4 in #59
- docs: add links to metrics docs by @VladimirKadlec in #60
- Retrieved RAG context from lightspeed-stack API by @bsatapat-jpg in #58
- Setting the execution bit only if it's not set by @andrej1991 in #61
- provider vertex support for judge llm by @andrej1991 in #29
- update tool call property by @asamal4 in #64
- add vertex to main eval & refactor by @asamal4 in #63
- Env setup/cleanup ability and verify through script by @asamal4 in #62
- add example & check for vLLM hosted inference server by @asamal4 in #66
- fix sample data by @asamal4 in #69
- use absolute imports by @asamal4 in #68
- fix: propagate api error message by @asamal4 in #72
- add common custom llm by @asamal4 in #70
- LCORE-723: Compute correct confidence interval by @bsatapat-jpg in #71
- Simplify custom prompt handling & re-organize by @asamal4 in #73
- add support for caching llm and api responses by @VladimirKadlec in #75
- standardize file name as per framework name in metric by @asamal4 in #76
- add intent eval by @asamal4 in #77
New Contributors
- @asamal4 made their first contribution in #1
- @VladimirKadlec made their first contribution in #2
- @matysek made their first contribution in #5
- @tisnik made their first contribution in #6
- @jrobertboos made their first contribution in #15
- @Anxhela21 made their first contribution in #30
- @are-ces made their first contribution in #36
- @bsatapat-jpg made their first contribution in #42
- @ItzikEzra-rh made their first contribution in #46
- @lpiwowar made their first contribution in #52
- @andrej1991 made their first contribution in #61
Full Changelog: https://github.com/lightspeed-core/lightspeed-evaluation/commits/v0.1.0