Skip to content

Releases: lightspeed-core/lightspeed-evaluation

LightSpeed Evaluation v0.6.0

06 Apr 22:30
e439782

Choose a tag to compare

Key Changes

  • Panel of Judges: Use multiple LLMs as judges with configurable aggregation strategies (max, average, majority_vote)
  • LLM Pool & Dynamic Parameters: Centralized configuration for multiple LLMs with flexible parameter management and inheritance
  • Token Tracking Improvement: Per-judge and conversation-level token usage tracking
  • MCP Server Headers: Support for MCP Server specific headers with backward compatibility
  • Bug Fixes: Resolved DeepEval cache issues and metric metadata override behavior
  • Web Dashboard: Interactive visualization for exploring and comparing evaluation results (PoC - No Official Support)

Maintenance:

  • Package/Dependency Upgrade: Version change for the dependencies (key change: RAGAS 0.4.0)
  • Requirements Files: requirements*.txt files with pinned version
  • CPU-Only PyTorch Support: Reduced package size for local embedding model usage

Deprecation Notice:

  • The single llm: configuration is deprecated. Migrate to llm_pool + judge_panel configuration. See docs/configuration.md for details.

Pull Requests

  • bump up eval tool version v0.6.0 by @asamal4 in #182
  • [LEADS-218] Improve partial tool match reporting with detailed extra tool information by @bsatapat-jpg in #181
  • chore: refactor e2e tests by @VladimirKadlec in #183
  • [LEADS-199] Add CPU-only PyTorch support to reduce package size by @bsatapat-jpg in #185
  • add e2e lcore Makefile target by @VladimirKadlec in #186
  • [LEADS-192] feat: integrate panel of judges by @asamal4 in #184
  • [LEADS-276] chore: consistent config exception for config data model val by @asamal4 in #187
  • [LEADS-253] Deepevals cache miss fixed by @xmican10 in #188
  • chore: update docs for judge panel feature by @asamal4 in #191
  • feat: add web dashboard by @rioloc in #172
  • [LEADS-267] fix: metric metadata override only changes given key by @asamal4 in #192
  • [LEADS-292] chore: Add Claude skill for PR review by @asamal4 in #195
  • [LEADS-25] Support new MCP header by @Anxhela21 in #175
  • [LEADS-232] Bump up dependencies for Python 3.11, 3.12, 3.13 support by @bsatapat-jpg in #189
  • fix: set mcp header to false in example config by @asamal4 in #196
  • [LEADS-252] Conversation-level metrics API token usage logging in CSV by @xmican10 in #194
  • [LEADS-195] feat: Add majority voting & avg to panel aggregation by @asamal4 in #198
  • [LEADS-256] SQL Storage Backend Foundation by @bsatapat-jpg in #197
  • [LEADS-211] feat: ability to add/remove model parameters from config by @asamal4 in #200
  • Add uv.lock and requirements.txt for lsc_agent_eval by @eranco74 in #201
  • [LEADS-232] Embedding caching handling with ragas 0.4 by @bsatapat-jpg in #199
  • chore: update sample system yaml with new llm pool and judge panel by @asamal4 in #204
  • chore: pin upper bound in pyproject, add req text files & add make targets by @asamal4 in #203
  • chore: readme - Add missing new features & cleanup by @asamal4 in #207

New Contributors

Full Changelog: v0.5.0...v0.6.0

LightSpeed Evaluation v0.5.0

06 Mar 14:05
f6d5f8f

Choose a tag to compare

Key Changes

  • Programmatic Integration: Use Lightspeed Evaluation as a Python library in your applications
  • Tool Results Evaluation: Evaluate tool execution results (regex)
  • User-Defined Metrics with Rubrics: Support for rubrics for custom GEval metrics
  • API Retry Mechanism: Automatic retries for API on rate-limit error
  • Token Tracking Fixes: Accurate token counting for JudgeLLM in multi-threaded evaluations and cached re-runs
  • SSL Certificate Verification: Add missing SSL verify for DeepEval integration
  • Output Fixes: Add missing metric_metadata values in CSV report
  • Error Handling Improvements: Skip metric when API populated fields are missing and set status as Error

Pull Requests

  • bump up eval tool version by @asamal4 in #152
  • Add ability to evaluate the tool results by @bparees in #151
  • Parse new LCORE tool_call format by @bparees in #150
  • remove unittest mock from streaming parser tests by @asamal4 in #154
  • Include toolcall result data in output by @bparees in #155
  • Add LiteLLM drop_params Support for Multi-Provider Compatibility by @arin-deloatch in #153
  • Add setup_ssl_verify() to DeepEvalLLMManager by @arin-deloatch in #158
  • [LEADS-217] Fix token count preservation on evaluation errors by @bsatapat-jpg in #157
  • add llm pool & judge panel config by @asamal4 in #156
  • [LEADS-236] chore: add exact folders to docstyle make target by @asamal4 in #162
  • [LEADS-235] chore: add repo approvers by @asamal4 in #161
  • [LEADS-230] fix: missing metric_metadata value in csv by @asamal4 in #163
  • [LEADS-191] feat: handle judge panel config in manager by @asamal4 in #164
  • [LEADS-208] Fix TokenTracker double-counting in multi-thread evaluation by @bsatapat-jpg in #159
  • chore: provide exact directories for black/ruff check by @asamal4 in #167
  • [LEADS-171] feat: add rubrics support for user-defined GEval metrics by @asamal4 in #166
  • [LEADS-242] chore: add data model for user-defined metrics by @asamal4 in #168
  • Replace identical data test with precise delta validation test by @asamal4 in #171
  • [LEADS-231] fix: set result status as Error when API populated field is missing by @asamal4 in #169
  • Leads 233 retry api call incase of error by @xmican10 in #173
  • feat: add e2e test with LSC api by @VladimirKadlec in #170
  • LEADS-240: Token usage should be 0 for a re-run with successful cache by @xmican10 in #176
  • [LEADS-226] chore: enforce pytest usage instead of unittest by @asamal4 in #178
  • feat: add programmatic API for library integration by @narmaku in #177
  • fix: add back pull_request_target trigger by @VladimirKadlec in #179
  • [LEADS-241] judge llm token counter for deepeval by @xmican10 in #180

New Contributors

Full Changelog: v0.4.0...v0.5.0

LightSpeed Evaluation v0.4.0

03 Feb 10:16
89b8ea7

Choose a tag to compare

What's Changed

Key Changes

  • Flexible Tool Evaluation: Configurable ordered/unordered & full/partial match modes for tool call validation
  • Classical Evaluation Metrics: Support for traditional evaluation metrics (bleu, rouge, distance metrics)
  • Alternate Expected Response: Ability to set alternate ground-truth responses for static evaluation metrics
  • Eval Configuration Tracking: Evaluation configuration details now included in generated reports for better reproducibility
  • API Latency Metrics: Latency tracking and reporting for API performance analysis (for API streaming endpoint)
  • Data Grouping: Tag-based grouping of evaluation conversations for better organization
  • Data Filtering: Filter evaluation datasets by tags and conversation IDs (CLI arguments) for targeted testing
  • Cache Warmup: New optional CLI argument to pre-warm (clear) caches before evaluation runs

Pull Requests

New Contributors

Full Changelog: v0.3.0...v0.4.0

LightSpeed Evaluation v0.3.0

30 Dec 18:28
0f8df44

Choose a tag to compare

What's Changed

Key Changes

  • Token Usage Statistics: Track and report token consumption during evaluations (both API and JudgeLLM usage)
  • Certificate Support for JudgeLLM: Configure custom certificates when connecting to Judge LLM endpoints
  • Skip on Failure: Optional config to skip remaining evaluations in a conversation group when any evaluation criteria fails
  • Optional Packages: torch and nvidia-* packages are now optional, significantly reducing install size for use cases that don't require them

PRs

New Contributors

Full Changelog: v0.2.0...v0.3.0

LightSpeed Evaluation v0.2.0

02 Dec 14:00
7665def

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.1.0...v0.2.0

LightSpeed Evaluation v0.1.0

10 Oct 15:12
f92850a

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: https://github.com/lightspeed-core/lightspeed-evaluation/commits/v0.1.0