Skip to content

Latest commit

 

History

History
607 lines (413 loc) · 48.8 KB

File metadata and controls

607 lines (413 loc) · 48.8 KB

Changelog

0.10.4 (2026-06-10)

Patch release: the ruby_llm_contract:optimize rake task now auto-loads in Rails apps. No behaviour change to the task itself.

Fixed

  • ruby_llm_contract:optimize no longer requires manual require "ruby_llm/contract/rake_task" in Rails apps. Pre-0.10.4 the docs claimed the task was "included" with RubyLLM::Contract::RakeTask, but the railtie did not load the file — adopters running bin/rails ruby_llm_contract:optimize got Unrecognized command until they added the require to Rakefile or lib/tasks/*.rake. The railtie now uses the standard rake_tasks { require "..." } idiom, lazy-loading the file only when rake is invoked (no boot cost).
  • Docs corrected in docs/guide/optimizing_retry_policy.md — the "rake task is included" line now explicitly states it auto-loads on 0.10.4+ in Rails, and that non-Rails / older-Rails setups still need the explicit require.

0.10.3 (2026-06-10)

Hot-fix release: the schema validator now correctly accepts nil on required-but-nullable fields. This unblocks OpenAI structured-output strict mode, where every property has to be in required and "nullable" is expressed as a null branch in anyOf/oneOf or as an array type — exactly the combination the prior validator wrongly rejected.

Fixed

  • SchemaValidator no longer rejects legal nil on required-nullable fields. Pre-0.10.3 (every published version 0.2.x–0.10.2) conflated "required" (must be present) with "non-nullable" (cannot be null) — JSON Schema treats those as orthogonal. The validator now honours all three idioms for nullability:

    • type: ["string", "null"] (array form)
    • type: "null" (degenerate scalar form)
    • anyOf: [{type: "string"}, {type: "null"}] and oneOf equivalents

    Adopter impact: if you bypassed the bug by setting required: false + disabling OpenAI strict: true on every nullable field, you can now restore strict: true and keep the field required (= OpenAI's standard nullable idiom). Non-nullable required fields still reject nil exactly as before; this is a strictly additive fix.

    Discovered via dogfooding in a production adopter using OpenAI strict structured output with 17 nullable fields ("set the rest to null" prompt). Smoking-gun mutation: remove nullable_schema? short-circuit from validate_nil_field — the three new spec cases in spec/ruby_llm/contract/contract/schema_validator_spec.rb fail.

0.10.2 (2026-06-10)

Patch release: ship the docs/guide/ directory inside the gem so adopters and LLM integration agents can read the manuals locally (via bundle show ruby_llm-contract or gem unpack) without an internet round-trip to GitHub. No code behavior change.

Added

  • docs/guide/* is now packaged with the gem (14 files, ~120 KB). Previously the README's "See also" links pointed at docs/guide/getting_started.md, docs/guide/optimizing_retry_policy.md, etc., but those files were stripped from the published gem - LLM integration agents (Cursor, Claude Code, Copilot) reported "no documentation in the gem" because the links 404'd locally. docs/ideas/ and doc/decisions/ remain excluded.

Fixed

  • models: keyword form documented + pinned for hash configs. retry_policy models: ["gpt-5-nano", { model: "gpt-5-mini", reasoning_effort: "high" }] is now covered by a spec and shown in getting_started.md. The block form (escalate(...)) and the keyword form share the same @configs storage; both forms accept config hashes.
  • reasoning_effort examples corrected to gpt-5 family. Pre-0.10.2 docs and specs paired reasoning_effort with gpt-4.1-* model names, which is incorrect: gpt-4.1 is not a reasoning model. Updated across docs/guide/getting_started.md, docs/guide/optimizing_retry_policy.md, CHANGELOG.md, and 7 spec files. Non-reasoning gpt-4.1 examples (model fallback chains without reasoning_effort) are unchanged.
  • Version mentions corrected in README and multimodal guide. README FAQ ("Upgraded to 0.9.0 - why?") now reads "Upgraded to 0.10.0 from 0.8.x"; docs/guide/multimodal_input.md references 0.10.0+ and 0.10.x instead of 0.9.0. 0.9.0 and 0.9.1 were tagged but never published to rubygems; adopters jump from 0.8.0 directly to 0.10.x.

0.10.1 (2026-06-01)

Patch release fixing gem packaging. 0.10.0 was yanked from rubygems.org due to the issue documented below; 0.10.1 is the recommended upgrade target. No code behavior change vs 0.10.0.

Fixed

  • Gem no longer ships internal tracker / dev configs. Excluded from spec.files: TODO.md, .rspec, .rubycritic.yml, .simplecov, and the .revive/ directory. Pre-0.10.1 the published gem contained these files; adopters who already extracted 0.10.0 can safely delete them.

0.10.0 (2026-06-01)

First published release since 0.8.0. Consolidates work originally tagged as 0.9.0 (multimodal input) and 0.9.1 (internal quality refactor), neither of which was pushed to rubygems. Adopters upgrading from 0.8.0 should read the Behavioural change and Breaking changes sections below before installing.

Breaking changes

  • validate(description, &block) and Definition#invariant(description, &block) now raise ArgumentError when description is nil or empty. Pre-0.10.0 the empty descriptor was silently accepted and produced "" entries in result.validation_errors, making debugging impossible. Codex audit found zero production use sites across lib/, examples/, README — only the regression-marker test certifying the bug.

Migration

Ensure every validate / invariant call has a non-empty descriptor (this is already how every README example writes them):

# Before (silently accepted, produced "" in validation_errors):
validate("") { |o| o[:score].between?(0, 100) }

# After (required):
validate("score in range 0-100") { |o| o[:score].between?(0, 100) }

Added

  • Multimodal input via context: { attachment: ... } — pass a file/IO/URL through Step.run(input, context: { attachment: path }); the adapter forwards it to RubyLLM::Chat#ask(content, with: attachment). RubyLLM normalises wire format per provider (Anthropic url/base64, OpenAI image_url/file, Gemini inline_data). Multi-attachment supported natively (with: [pdf1, pdf2] or with: { images: [...], pdfs: [...] }). See multimodal input guide and ADR-0022.
  • attachment_token_estimate(n) class macro — adopter-declared conservative estimate of attachment input tokens. Applied to BOTH runtime (limit_checker) and pre-flight (estimate_cost) — same source of truth, no estimate/runtime drift.
  • on_unknown_attachment_size(:refuse | :warn) class macro — mirrors on_unknown_pricing opt-out semantics. Defaults to :refuse. Never settable as global default — same invariant as max_cost fail-closed.

Behavioural change (READ BEFORE UPGRADING)

  • Contracts with max_cost or max_input AND context[:attachment] set AND no attachment_token_estimate declared → REFUSE with :limit_exceeded. This is fail-closed semantics: the gem cannot bound vision/PDF token cost without an adopter-declared estimate. Opt out per-step with on_unknown_attachment_size :warn. Text-only contracts and contracts without max_cost/max_input are unaffected.

Changed

  • run_eval (no args) return shape pinned to Hash<String, Report> keyed by eval name. Documents the existing contract used by RubyLLM::Contract::RakeTask#collect_host_reports and adopters. No runtime change vs 0.8.0 — only the spec assertion now locks the shape.
  • Parser.parse(text, strategy: :json) first-bracket-wins boundary documented. Extraction commits to the first balanced { or [ structure and does NOT retry on later candidates. Empty {} followed by real JSON parses as the empty Hash; non-JSON {braces} before real JSON raises ParseError. No runtime change — this codifies long-standing behavior with explicit boundary tests.

Fixed

  • with_retry_disabled no longer mutates the step class's singleton method. The optimizer now passes retry_policy_override: nil through context: to compare_models, which Step::Base#runtime_settings already honours. Removes a concurrency hazard where two parallel optimize_retry_policy calls on the same step would race on the singleton restore in ensure.
  • CostCalculator.find_model exposed as a public class method. Removes two CostCalculator.send(:find_model, ...) workarounds in Step::Base#estimate_cost. The estimated_cost_for helper is gone — estimate_cost now routes through the existing public CostCalculator.calculate(model_name:, usage:).
  • stub_step unified on a single storage path. Both block and non-block forms now write to RubyLLM::Contract.step_adapter_overrides (thread-local). The around(:each) hook in rspec.rb handles cleanup between examples. Removes the prior allow(step).to receive(:run) branch.

Internal

  • Anti-facade audit complete: 89/89 spec files under per-test 17-mode walk (Phase A: 26 specs, Phase C: 63 specs via parallel Codex fan-out). Net +30 strengthened tests against mutation-blind assertions, zero public API change beyond the breaking entry above.
  • Dead ObjectSpace.each_object(Class) fallback removed in concerns/eval_host.rb#register_subclasses. The gemspec requires Ruby >= 3.2.0, so Class#subclasses (Ruby ≥ 3.1) is always available; the legacy fallback was unreachable code that would have iterated all loaded classes O(n) and was not thread-safe.

Deferred (not in 0.10.x)

  • add_history multi-turn replay of prior attachments — single-turn multimodal supported; follow-up questions on the same document deferred to a later release.
  • Streaming + attachment — contract steps remain synchronous.
  • Provider-specific attachment size caps — surface only via attachment_token_estimate calibration; consult provider docs.

Tests

  • Suite: 1401 examples / 0 failures / 7 pending (was 1346/0/8 at 0.8.0).

0.8.0 (2026-04-26)

Narrative repositioning + small API additions. Internal architecture unchanged: no Step::Base refactor, no breaking changes to existing DSL.

Added

  • thinking(effort:, budget:) class macro on Step::Base — mirrors RubyLLM::Agent.thinking signature exactly. Stored as { effort:, budget: } hash; reader returns the hash; supports :default reset semantics; superclass inheritance like model/temperature. The convenience alias reasoning_effort(:low) is implemented as thinking(effort: :low) — single normalized state, not separate ivar.
  • Adapter wiring for with_thinking — when thinking is set on the Step class, OR when reasoning_effort: is passed through context, OR when an attempt config in retry_policy escalate(...) carries reasoning_effort:, the RubyLLM adapter resolves the effective { effort:, budget: } hash and forwards it via chat.with_thinking(**) — provider-agnostic (supports OpenAI reasoning_effort AND Anthropic extended-thinking budget). Precedence: per-attempt / context reasoning_effort overrides class-level thinking[:effort]; budget is taken from class-level thinking[:budget]. Behavioural change vs 0.7.x: reasoning_effort is now forwarded via with_thinking instead of with_params. Same wire-level OpenAI parameter; provider-agnostic Anthropic support is now automatic.

Dependencies

  • ruby_llm constraint bumped from ~> 1.0 to ~> 1.12Chat#with_thinking is the canonical path for reasoning effort + extended thinking; it shipped in RubyLLM 1.12. Adopters on ruby_llm < 1.12 need to bump RubyLLM before upgrading this gem to 0.8.0.

Changed

  • Tagline + README opening — repositioned around "Contracts + Evals for RubyLLM". New "Relation to RubyLLM::Agent" section explicitly frames Step as a sibling abstraction (same niche as Agent, wider contract), not an alternative or foundation. README does not claim "Step uses Agent under the hood" — current call path is Step → Runner → Adapters::RubyLLM → RubyLLM.chat directly.
  • TokenEstimator documented as heuristic — module docstring expanded with explicit "±30% accuracy" framing. Refusal messages from LimitChecker now include (heuristic ±30%) suffix so adopters know the pre-flight number is estimated, not measured. RubyLLM 1.14 also has no pre-flight tokenizer; RubyLLM::Tokens is post-hoc only.
  • CostCalculator repositioned in docs — module narrative reframed from "cost calculator" to "fine-tune pricing registry + lookup with fallback chain". Math methods (compute_cost, token_cost, etc.) were already private; this release makes the docs match. Public API surface unchanged: register_model, unregister_model, reset_custom_models!, calculate.
  • output_schema reframed in docs — described as "wrapper around RubyLLM::Schema + client-side validation step", not a standalone feature. The schema language is identical to what RubyLLM::Agent.schema accepts; the difference is what wraps it.
  • README retry framingretry_policy escalate(...) (model escalation on validation failure) is the marketed default. retry_policy attempts: N (same-model retry) stays in the API for backward compat and niche cases (subjective criteria, multi-step pipelines, weaker models) but is no longer marketed as a recommended default. Empirical basis: four small experiments across PDF quiz generation, GSM8K math (n=30 + n=120), and multi-constraint schedule generation found no useful lift for nano-class models on tasks with clear correctness criteria.

Documentation

  • New disambiguation paragraphs in prompt_ast.md (Step.input_type vs RubyLLM::Agent.inputs; Prompt::Builder multi-role DSL vs Agent ERB single-string template loader), testing.md (Step.observe vs Chat#on_end_message / on_tool_call), output_schema.md (relation to Agent.schema), and optimizing_retry_policy.md (orthogonal model + thinking dimensions).
  • getting_started.md refusal message example updated to include the new (heuristic ±30%) suffix.

Issues closed

  • #11 (Optimizer is blind to same-model attempts) — closed after empirical experiments. attempts: N retry stays in API; not marketed as a default.
  • #6 (Production cost reporting) — already implemented in 0.7.x; close confirmed.

Not in this release (deferred)

  • output_schema Proc form for runtime-input-aware schemas (parity with Agent.schema Proc form). Additive, low-risk; deferred to 0.9 to keep 0.8 scope tight.
  • H4 (Step composing RubyLLM::Agent internally as config holder) — verified feasible but ROI insufficient for current adopter base; trigger-based revisit, no calendar commitment.

0.7.3 (2026-04-24)

Adoption-friction release. No runtime behavior changes — every delta is in docs/, examples/, or spec/integration/ (plus the version.rb / Gemfile.lock bumps). Upgrading from 0.7.2 picks up the expanded guide set, the new runnable showcases, and one extra integration spec.

Documentation

  • New guide: docs/guide/why.md — four production failure modes the gem exists for (schema-valid logically wrong, silent prompt regression, sampling variance on fixed-temperature models, runaway cost). Opens from a concrete incident each time; designed for readers who have not yet felt the pain the gem relieves.
  • New guide: docs/guide/rails_integration.md — seven Rails-specific FAQs with runnable snippets: where step classes live (app/contracts/), initializer setup, background jobs, around_call observability, RSpec/Minitest stubs, error handling in controllers, CI gate wiring.
  • README adoption-friction pass — added a short "Do I need this?" block after Install, a reading-order hint (README → why.md → getting_started.md), and outcome-based labels in the docs index ("Prevent silent prompt regressions" instead of "Eval-First", etc.).
  • TL;DR box at the top of every guide — single-sentence orientation for readers who land via search; "Skip if" clause added where real confusion exists (eval_first.md, testing.md, migration.md).
  • API coverage gaps closedestimate_cost / estimate_eval_cost, max_cost on_unknown_pricing: :warn, run_eval(..., concurrency:), around_call testing patterns now documented in getting_started.md, eval_first.md, testing.md.
  • Industry-standard terminologytemperature-lockedfixed-temperature, variance-inducedsampling variance, severity signalsseverity keywords, takeaway drifttone/takeaways mismatch.
  • docs/architecture.md refresh — diagram now reflects the current class layout: added Step::RetryPolicy, Pipeline::Result, Eval::AggregatedReport, Eval::BaselineDiff, Eval::PromptDiffComparator, Eval::EvalHistory, Eval::RetryOptimizer, OptimizeRakeTask. Replaced the outdated Eval::TraitEvaluator entry with Eval::ExpectationEvaluator.
  • Business framing added to guides — every guide opens with a concrete production scenario or "why it matters" hook before the API reference.

Examples — consolidated on SummarizeArticle, renumbered 00-06

The previous 12-file set mixed a private Reddit promo planner, customer support, meetings, keyword extraction, and translation. The new set is seven runnable files, each answering one adopter question on the README's SummarizeArticle case.

# File Answers
00 00_basics.rb How do I start? (seven incremental layers + real-LLM pointer)
01 01_fallback_showcase.rb Show me the gem in 30 seconds (zero API keys)
02 02_real_llm_minimal.rb How do I plug in a real LLM? (~30 lines)
03 03_summarize_with_keywords.rb How does the contract evolve? (growing prompt)
04 04_summarize_and_translate.rb Pipeline composition + pipeline-level run_eval
05 05_eval_dataset.rb How do I stop silent prompt regressions?
06 06_retry_variants.rb attempts: 3, reasoning_effort escalation, cross-provider (Ollama → Anthropic → OpenAI)

Every file carries an "Expected output" block in its header so readers see the result without running the script. The docs/ideas/ directory is now fully untracked (already in .gitignore; one stray file removed from version control).

Examples — bug fixes carried along

  • Schema pitfall fixed in 5 filesarray :x do; string :y; ...; end silently produces items: string and drops every declaration after the first, matching the documented pitfall in spec/ruby_llm/contract/nested_schema_spec.rb:71. Every affected array block is now wrapped in object do...end.
  • examples/05_eval_dataset.rb (pre-renumber: 09_eval_dataset.rb) result[:passed]result.passed? — the previous code called [] on an Eval::CaseResult and raised NoMethodError at runtime.

Testing

  • New spec/integration/pipeline_eval_spec.rb — three cases guaranteeing pipeline-level run_eval stays functional: happy path, final-step mismatch, and fail-fast propagation when an intermediate validate rejects. Closes the "09 STEP 5 pipeline evaluation" known issue flagged in the 0.7.2 release. The fail-fast case asserts step_status == :validation_failed and the validate's label in details, so a regression that short-circuits on schema instead of validate would fail loudly.

Deleted (private-project cleanup)

  • examples/01_classify_threads.rb, 02_generate_comment.rb, 03_target_audience.rb, 10_reddit_full_showcase.rb, spec/integration/reddit_pipeline_spec.rb — Reddit Promo Planner was a separate private project; its examples do not belong in the gem's public repo.
  • examples/02_output_schema.rb — fully covered by docs/guide/output_schema.md; deleting avoids duplication.

0.7.2 (2026-04-22)

Changed

  • Terminal output labels renamed for consistency with README narrative. print_summary now prints Hardest eval (was Constraining eval), Suggested fallback list (was Suggested chain), and the production-mode table uses first-attempt / fallback % as column headers (was single-shot / escalation). Programmatic metric names unchanged: single_shot_cost, single_shot_latency_ms, escalation_rate. RetryOptimizer::Result exposes hardest_eval as an alias for constraining_eval.

Documentation

  • docs/guide/optimizing_retry_policy.md rewritten — 17.7k → 6.4k characters. Continues the SummarizeArticle narrative from README. Offline mode clearly positioned as wiring-check; real optimization runs via LIVE=1 RUNS=3. Output samples match actual print_summary format.
  • docs/guide/getting_started.md rewritten — 8.7k → 6.1k. Every example uses SummarizeArticle. Evals + CI gates section moved before Budget caps. Structured Prompts / Dynamic Prompts / "Already using ruby_llm?" / Reasoning effort sections removed; content delegated to prompt_ast.md and README.
  • docs/guide/eval_first.md refined — 6.3k → 5.0k. Switched to SummarizeArticle case. Team workflow section compressed with links back to getting_started.md for the matcher chain.
  • docs/guide/testing.md refined — 10.7k → 7.4k. Switched to SummarizeArticle case. Threshold gating / Rake task / baseline walkthrough / prompt A/B sections delegated back to getting_started.md and eval_first.md.
  • docs/guide/output_schema.md DSL bug fix — the Supported constraints table documented JSON Schema camelCase keys (minLength, minItems, additionalProperties) that are not valid DSL arguments. Every copy-paste from the previous table would have raised ArgumentError. Switched to snake_case (min_length, min_items, additional_properties) as the DSL actually expects; added a short note on the internal camelCase conversion.
  • docs/guide/best_practices.md, pipeline.md, migration.md sanity pass — terminology alignment (model escalation → model fallback where narrative; escalate DSL method unchanged) and SummarizeArticle case where the guide is not inherently multi-step.

0.7.1 (2026-04-22)

Changed (behavioral, follow-up to v0.7.0)

  • Step::Base#run_once no longer swallows adapter-phase ArgumentError as :input_error. The previous blanket rescue ArgumentError was there to convert DSL misconfiguration (e.g. missing prompt) into an :input_error Result. Side effect: programmer bugs in adapter code that raised ArgumentError (wrong arity, bad config argument) were silently coerced into :input_error and retried as if the user had given bad input. Now the rescue is narrowed to the Runner-construction phase only — DSL configuration errors still produce :input_error (the prompt has not been set case is regression-tested), but ArgumentError raised from adapter code during Runner#call propagates to the caller. Input-type validation failures continue to produce :input_error through InputValidator's own scoped rescue, unchanged.

0.7.0 (2026-04-21)

Breaking changes

  • :adapter_error removed from DEFAULT_RETRY_ON. New default: [:validation_failed, :parse_error]. ruby_llm already retries transport errors (RateLimitError, ServerError, ServiceUnavailableError, OverloadedError, timeouts) at the Faraday layer, so the previous default re-ran the same model on errors the HTTP middleware already retried with backoff. To restore pre-0.7 behavior: retry_on :validation_failed, :parse_error, :adapter_error. Recommended pattern: pair :adapter_error with escalate "model_a", "model_b" — a different model/provider can bypass what transport retry could not.
  • AdapterCaller narrows rescue from StandardError to RubyLLM::Error + Faraday::Error. Provider errors and transport errors that escape ruby_llm's Faraday retry middleware (Faraday::TimeoutError, Faraday::ConnectionFailed) still produce :adapter_error as before. Programmer errors that are neither (NoMethodError, adapter code bugs) now propagate instead of being silently converted to :adapter_error and retried. Known limitation: adapter code raising ArgumentError is still coerced into :input_error by Step::Base#run_once (which rescues ArgumentError for input-type validation). Disambiguating adapter-ArgumentError vs input-validation-ArgumentError requires a run_once refactor and is tracked as a follow-up.

Migration

If you rely on the old behavior, opt in explicitly:

retry_policy do
  attempts 3
  retry_on :validation_failed, :parse_error, :adapter_error
end

Or better, with a model fallback chain:

retry_policy do
  escalate "gpt-4.1-nano", "gpt-4.1-mini"
  retry_on :validation_failed, :parse_error, :adapter_error
end

0.6.4 (2026-04-20)

Features

  • production_mode: on compare_models and optimize_retry_policy — measures retry-aware, end-to-end cost per successful output. Pass production_mode: { fallback: "gpt-5-mini" } and each candidate runs with a runtime-injected [candidate, fallback] retry chain. The report exposes escalation_rate, single_shot_cost, and effective_cost so "the cheaper candidate" decision matches production cost rather than first-attempt cost.
  • New Report metricsescalation_rate, single_shot_cost, effective_cost, single_shot_latency_ms, effective_latency_ms, latency_percentiles (p50/p95/max). AggregatedReport averages all of them across runs:.
  • Extended ModelComparison#table — when production_mode: is set, renders a Chain column (candidate → fallback) with single-shot, escalation, effective cost, latency, score. Edge case candidate == fallback renders as a single model and in the escalation column, with retry injection skipped entirely so effective == single-shot by construction, not by coincidence.
  • context[:retry_policy_override] — new context key that nullifies or replaces class-level retry_policy for a single call. Used internally by production-mode injection; safe to use directly when you need a transient override that doesn't mutate the step class.

Scope

  • Single-fallback (2-tier) chains only. Multi-tier chains can be inspected post-hoc via trace.attempts but aren't summarized in the optimize table.
  • Costs with runs: 3 + production_mode: { fallback: "gpt-5-mini" } are ≈3× a single-shot eval plus the actual retry attempts — not 6×. Production-mode metrics come from a single pass.
  • Step-only. Calling compare_models with production_mode: on a Pipeline::Base subclass raises ArgumentError — retry injection is Step-level and pipeline-wide fallback semantics aren't defined yet. Benchmark individual steps.

Documentation

0.6.3 (2026-04-20)

Features

  • runs: parameter on compare_models and optimize_retry_policy — runs each candidate N times per eval and aggregates the mean score, mean cost per run, and mean latency. Reduces sampling variance in live mode where LLM outputs are non-deterministic (gpt-5 family enforces temperature=1.0 server-side, so a single unlucky sample can misclassify a viable candidate as "failing"). Default runs: 1 — backward compatible.
  • RUNS=N on rake ruby_llm_contract:optimize — CLI flag for variance-aware optimization.
  • Eval::AggregatedReport — duck-type Report exposing score (mean), score_min/score_max (spread), total_cost (mean per run), pass_rate (clean-pass count x/N), and clean_passes.
  • Guide: Reducing variance with runs: — when to use it and why.

0.6.2 (2026-04-18)

Features

  • Step.optimize_retry_policy — runs compare_models on ALL evals for the step, builds a score matrix, identifies the constraining eval, and suggests a retry chain. Chain's last model always passes all evals (safe fallback).
  • rake ruby_llm_contract:optimize — one-command retry chain optimization. Prints score table, constraining eval, suggested chain, and copy-paste DSL.
  • Offline by defaultoptimize uses sample_response (zero API calls) unless LIVE=1 or PROVIDER= is set.
  • EVAL_DIRS= support — non-Rails setups can specify eval file directories.
  • Guide: Optimizing retry_policy — full procedure with prerequisites, troubleshooting, and real-world example.

Fixes

  • Chain semantics aligned with retry_executor — retry fires on validation_failed/parse_error, not on low eval score. Disjoint eval coverage (A passes e1, B passes e2, neither passes both) correctly returns empty chain.
  • Removed ActiveSupport dependency from rake task (.presence.empty?).
  • Added require "set" for non-Rails environments.

0.6.1 (2026-04-17)

Features

  • Multi-provider operator tooling — rake tasks support PROVIDER=openai|anthropic|ollama, CANDIDATES=model@effort,..., and REASONING_EFFORT=low|medium|high.
  • rake ruby_llm_contract:recommend — wraps Step.recommend with CLI interface, prints best config, retry chain, DSL, rationale, and savings.
  • Ollama supportPROVIDER=ollama with configurable OLLAMA_API_BASE.

0.6.0 (2026-04-12)

"What should I do?" — model + configuration recommendation.

Features

  • Step.recommendClassifyTicket.recommend("eval", candidates: [...], min_score: 0.95) runs eval on all candidates and returns a Recommendation with optimal model, retry chain, rationale, savings vs current config, and to_dsl code output.
  • Candidates as configurationscandidates: accepts { model:, reasoning_effort: } hashes, not just model name strings. gpt-5-mini with reasoning_effort: "low" is a different candidate than with "high".
  • compare_models extended — new candidates: parameter alongside existing models: (backward compatible). Candidate labels include reasoning effort in output table.
  • Per-attempt reasoning_effort in retry policiesescalate accepts config hashes: escalate({ model: "gpt-5-nano" }, { model: "gpt-5-mini", reasoning_effort: "high" }). Each attempt gets its own reasoning_effort forwarded to the provider.
  • pass_rate_ratio — numeric float (0.0–1.0) on Report and ReportStats, complementing the string pass_rate ("3/5").
  • History entries enrichedsave_history! accepts reasoning_effort: and stores model, reasoning_effort, pass_rate_ratio in JSONL entries.

Game changer continuity

v0.2: "Which model?"          → compare_models (snapshot)
v0.3: "Did it change?"        → baseline regression (binary)
v0.4: "Show me the trend"     → eval history (time series)
v0.5: "Which prompt is better?" → compare_with (A/B testing)
v0.6: "What should I do?"     → recommend (actionable advice)

0.5.2 (2026-04-06)

Features

  • reasoning_effort forwarded to providercontext: { reasoning_effort: "low" } now passed through with_params to the LLM. Previously accepted as a known context key but silently ignored by the RubyLLM adapter.

0.5.0 (2026-03-25)

Data-Driven Prompt Engineering.

Features

  • observe DSL — soft observations that log but never fail. observe("scores differ") { |o| o[:a] != o[:b] }. Results in result.observations. Logged via Contract.logger when they fail. Runs only when validation passes.
  • compare_with — prompt A/B testing. StepV2.compare_with(StepV1, eval: "regression", model: "nano") returns PromptDiff with improvements, regressions, score_delta, safe_to_switch?. Reuses BaselineDiff internally.
  • RSpec compared_with chainexpect(StepV2).to pass_eval("x").compared_with(StepV1).without_regressions blocks merge if new prompt regresses any case.

Game changer continuity

v0.2: "Which model?"          → compare_models (snapshot)
v0.3: "Did it change?"        → baseline regression (binary)
v0.4: "Show me the trend"     → eval history (time series)
v0.5: "Which prompt is better?" → compare_with (A/B testing)

0.4.5 (2026-03-24)

Audit hardening — 18 bugs fixed across 4 audit rounds.

Fixes

  • RakeTask history before aborttrack_history now saves all reports (pass and fail) before gating, so failed runs appear in eval history.
  • RSpec/Minitest stub scoping — block form stub_step uses thread-local overrides with real cleanup. Non-block stub_all_steps auto-restored by RSpec around(:each) hook and Minitest setup/teardown.
  • StepAdapterOverride — handles context: nil and respects string key "adapter". Moved to contract.rb so both test frameworks share one mechanism.
  • max_cost fail closed output estimate — preflight uses 1x input tokens as output estimate when max_output not set, preventing cost bypass for output-expensive models.
  • reset_configuration! clears overridesstep_adapter_overrides now cleared on reset.
  • CostCalculator.register_model — validates Numeric, finite?, non-negative. Rejects NaN, Infinity, strings, nil.
  • Pipeline token_budget — rejects negative and zero values (parity with timeout_ms).
  • track_history model fallback — uses step DSL model, then default_model when context has no model. Handles string key "model".
  • estimate_cost / estimate_eval_cost — falls back to step DSL model when no explicit model arg given.
  • stub_steps string keys — both RSpec and Minitest normalize string-keyed options with transform_keys(:to_sym).
  • DSL :default resetmodel(:default), temperature(:default), max_cost(:default) reset inherited parent values.

0.4.4 (2026-03-24)

  • stub_steps (plural) — stub multiple steps with different responses in one block. No nesting needed. Works in RSpec and Minitest:
    stub_steps(
      ClassifyTicket => { response: { priority: "high" } },
      RouteToTeam => { response: { team: "billing" } }
    ) { TicketPipeline.run("test") }

0.4.3 (2026-03-24)

Production feedback release.

Features

  • stub_step block formstub_step(Step, response: x) { test } auto-resets adapter after block. Works in RSpec and Minitest. Eliminates leaked test state.
  • Minitest per-step routingstub_step(StepA, ...) now actually routes to StepA only (was setting global adapter, ignoring step class).
  • track_history in RakeTaskt.track_history = true auto-appends every eval run (pass and fail) to .eval_history/. Drift detection without manual save_history! calls.
  • max_cost fail closed — unknown model pricing now refuses the call instead of silently skipping. Set on_unknown_pricing: :warn for old behavior.
  • CostCalculator.register_model — register pricing for custom/fine-tuned models: register_model("ft:gpt-4o", input_per_1m: 3.0, output_per_1m: 6.0).

0.4.2 (2026-03-24)

  • RakeTask lazy contextt.context now accepts a Proc, resolved at task runtime (after :environment). Fixes adapter not being available at Rake load time in Rails apps.

0.4.1 (2026-03-24)

  • RakeTask :environment fix — uses defined?(::Rails) instead of Rake::Task.task_defined?(:environment). Works in Rails 8 without manual Rake::Task.enhance.
  • Concurrent eval deterministicclone_for_concurrency protocol, ContextHelpers extracted.
  • README — added eval history, concurrency, quality tracking examples.

0.4.0 (2026-03-24)

Observability & Scale — see what changed, run it fast, debug it easily.

Features

  • Structured loggingContract.configure { |c| c.logger = Rails.logger }. Auto-logs model, status, latency, tokens, cost on every step.run.
  • Batch eval concurrencyrun_eval("regression", concurrency: 4). Parallel case execution via Concurrent::Future. 4x faster CI for large eval suites.
  • Eval history & trendingreport.save_history! appends to JSONL. report.eval_history returns EvalHistory with score_trend, drift?, run-by-run scores.
  • Pipeline per-step evaladd_case(..., step_expectations: { classify: { priority: "high" } }). See which step in a pipeline regressed.
  • Minitest supportassert_satisfies_contract, assert_eval_passes, stub_step for Minitest users. require "ruby_llm/contract/minitest".

Game changer continuity

v0.2: "Which model?"          → compare_models (snapshot)
v0.3: "Did it change?"        → baseline regression (binary)
v0.4: "Show me the trend"     → eval history (time series)
      "Which step changed?"   → pipeline per-step eval
      "Run it fast"           → batch concurrency

0.3.7 (2026-03-24)

  • Trait missing key = errorexpected_traits: { title: 0..5 } on output {} now fails instead of silently passing.
  • nil input in dynamic promptsrun(nil) with prompt { |input| ... } correctly passes nil to block.
  • Defensive sample pre-validationsample_response uses the same parser as runtime (handles code fences, BOM, prose around JSON).
  • Baseline diff excludes skipped — self-compare with skipped cases no longer shows artificial score delta.
  • Zeitwerk eval/ ignoreeager_load_contract_dirs! ignores eval/ subdirs before eager load.

0.3.6 (2026-03-24)

  • Recursive array/object validation — nested arrays (array of array of string) validated recursively. Object items validated even without :properties (e.g. additionalProperties: false).
  • Deep symbolize in sample pre-validation — array samples with string keys ([{"name" => "Alice"}]) correctly symbolized before schema validation.

0.3.5 (2026-03-24)

  • String constraints in SchemaValidatorminLength/maxLength enforced for root and nested strings.
  • Array item validation — scalar items (string, integer) validated against items schema type and constraints.
  • Non-JSON sample_response fails fastsample_response("hello") with object schema raises ArgumentError at definition time instead of silently passing.
  • max_tokens in KNOWN_CONTEXT_KEYS — no more spurious "Unknown context keys" warning.
  • Duplicate models deduplicatedcompare_models(models: ["m", "m"]) runs model once.

0.3.4 (2026-03-24)

  • SchemaValidator validates non-object roots — boolean, integer, number, array root schemas now enforce type, min/max, enum, minItems/maxItems. Previously only object schemas were validated.
  • Removed passing cases = regressionregressed? returns true when baseline had passing cases that are now missing. Prevents gate bypass by deleting eval cases.
  • JSON string sample_response fixedsample_response('{"name":"Alice"}') correctly parsed for pre-validation instead of double-encoding.
  • context[:max_tokens] forwarded — overrides step's max_output for adapter call AND budget precheck.

0.3.3 (2026-03-23)

  • Skipped cases visible in regression diff — baseline PASS → current SKIP now detected as regression by without_regressions and fail_on_regression.
  • Skip only on missing adapter — eval runner no longer masks evaluator errors as SKIP. Only "No adapter configured" triggers skip.
  • Array/Hash sample pre-validationsample_response([{...}]) correctly validated against schema instead of silently skipping.
  • assume_model_exists: false forwarded — boolean false no longer dropped by truthiness check in adapter options.
  • Duplicate case names caught at definitionadd_case/verify with same name raises immediately, not at run time.

0.3.2 (2026-03-23)

  • Array response preservedAdapters::RubyLLM no longer stringifies Array content. Steps with output_type Array work correctly.
  • Falsy prompt inputrun(false) and build_messages(false) pass false to dynamic prompt blocks instead of falling back to instance_eval.
  • retry_on flattenretry_on([:a, :b]) no longer wraps in nested array.
  • Builder resetPrompt::Builder resets nodes on each build (no accumulation on reuse).
  • Pipeline false outputoutput: false no longer shows "(no output)" in pretty_print.

0.3.1 (2026-03-23)

Fixes from persona_tool production deployment (4 services migrated).

  • Proc/Lambda in expected_traitsexpected_traits: { score: ->(v) { v > 3 } } now works.
  • Zeitwerk eager-loadload_evals! eager-loads app/contracts/ and app/steps/ before loading eval files. Fixes uninitialized constant errors in Rake tasks.
  • Falsy valuesexpected: false, input: false, sample_response(nil) all handled correctly.
  • Context key forwardingprovider: and assume_model_exists: forwarded to adapter. schema: and max_tokens: are step-level only (no split-brain).
  • Deep-freeze immutability — constructors never mutate caller's data.

0.3.0 (2026-03-23)

Baseline regression detection — know when quality drops before users do.

Features

  • report.save_baseline! — serialize eval results to .eval_baselines/ (JSON, git-tracked)
  • report.compare_with_baseline — returns BaselineDiff with regressions, improvements, score_delta, new/removed cases
  • diff.regressed? — true when any previously-passing case now fails
  • without_regressions RSpec chainexpect(Step).to pass_eval("x").without_regressions
  • RakeTask fail_on_regression — blocks CI when regressions detected
  • RakeTask save_baseline — auto-save after successful run
  • Migration guidedocs/guide/migration.md with 7 patterns for adopting the gem in existing Rails apps

Stats

  • 1086 tests, 0 failures

0.2.3 (2026-03-23)

Production hardening from senior Rails review panel.

  • around_call propagates exceptions — no longer silently swallows DB errors, timeouts, etc. User who wants swallowing can rescue in their block.
  • Nil section content skippedsection "X", nil no longer renders "null" to the LLM. Section is omitted entirely.
  • Range support in expected:expected: { score: 1..5 } works in add_case. Previously only Regexp was supported.
  • Trace#digtrace.dig(:usage, :input_tokens) works on both Step and Pipeline traces.

0.2.2 (2026-03-23)

Fixes from first real-world integration (persona_tool).

  • around_call fires per-run — not per-attempt. With retry_policy, callback fires once with final result. Signature: around_call { |step, input, result| ... }
  • Result#trace always Trace object — never bare Hash. result.trace.model works on success AND failure.
  • around_call exception safe — warns and returns result instead of crashing.
  • model DSLmodel "gpt-4o-mini" per-step. Priority: context > step DSL > global config.
  • Test adapter raw_output always String — Hash/Array normalized to .to_json.
  • Trace#digtrace.dig(:usage, :input_tokens) works.

0.2.1 (2026-03-23)

Production DX improvements from first real-world integration (persona_tool).

Features

  • temperature DSLtemperature 0.3 in step definition, overridable via context: { temperature: 0.7 }. RubyLLM handles per-model normalization natively.
  • around_call hook — callback for logging, metrics, observability. Replaces need for custom middleware.
  • build_messages public — inspect rendered prompt without running the step.
  • stub_step RSpec helperstub_step(MyStep, response: { ... }) reduces test boilerplate. Auto-included via require "ruby_llm/contract/rspec".
  • estimate_cost / estimate_eval_cost — predict spend before API calls.

Fixes

  • Reload lifecycleload_evals! clears definitions before re-loading. Railtie hooks config.to_prepare for development reload. define_eval warns on duplicate name (suppressed during reload).
  • Pipeline eval cost — uses Pipeline::Trace#total_cost (all steps), not just last step.
  • Adapter isolationcompare_models and run_all_own_evals deep-dup context per run.
  • Offline mode — cases without adapter return :skipped instead of crashing. Skipped cases excluded from score.
  • expected_traits reachable from define_eval DSL via add_case.
  • verify raises when both positional and expect: keyword provided.
  • best_for excludes zero-score models from recommendation.
  • print_summary replaces pretty_print (avoids Kernel#pretty_print shadow).
  • CaseResult#to_h round-trips correctly (name: key).

Docs

  • All 5 guides updated for v0.2 API
  • Symbol keys documented
  • Retry model priority documented
  • Test adapter format documented

Stats

  • 1077 tests, 0 failures
  • 3 architecture review rounds, 32 findings fixed

0.2.0 (2026-03-23)

Contracts for LLM quality. Know which model to use, what it costs, and when accuracy drops.

Breaking changes

  • report.results returns CaseResult objects instead of hashes. Use result.name, result.passed?, result.score instead of result[:case_name], result[:passed]. CaseResult#to_h for backward compat.
  • report.print_summary replaces report.pretty_print (avoids shadowing Kernel#pretty_print).

Features

  • add_case in define_evaladd_case "billing", input: "...", expected: { priority: "high" } with partial matching. Supports expected_traits: for regex/range matching.
  • CaseResult value objectsresult.name, result.passed?, result.output, result.expected, result.mismatches (structured diff), result.cost, result.duration_ms.
  • report.failures — returns only failed cases. report.skipped counts skipped (offline) cases.
  • Model comparisonStep.compare_models("eval", models: %w[nano mini full]) runs same eval across models. Returns table with score/cost/latency per model. comparison.best_for(min_score: 0.95) returns cheapest model meeting threshold.
  • Cost trackingreport.total_cost, report.avg_latency_ms, per-case result.cost. Pipeline eval uses total pipeline cost, not just last step.
  • Cost predictionStep.estimate_cost(input:, model:) and Step.estimate_eval_cost("eval", models: [...]) predict spend before API calls.
  • CI gatingpass_eval("regression").with_minimum_score(0.8).with_maximum_cost(0.01). RakeTask with suite-level minimum_score and maximum_cost.
  • RubyLLM::Contract.run_all_evals — discovers all Steps/Pipelines with evals, runs them all. Includes inherited evals.
  • RubyLLM::Contract::RakeTaskrake ruby_llm_contract:eval with minimum_score, maximum_cost, fail_on_empty, eval_dirs.
  • Rails Railtie — auto-loads eval files via config.after_initialize + config.to_prepare (supports development reload).
  • Offline mode — cases without adapter return :skipped instead of crashing. Skipped cases excluded from score/passed.
  • Safe define_eval — warns on duplicate name; suppressed during reload.

Fixes

  • P1: Eval files not autoloaded by Rails — Railtie uses load (not Zeitwerk). Hooks into reloader for dev.
  • P2: report.results returns raw Hashes — now returns CaseResult objects.
  • P3: No way to run all evals at onceContract.run_all_evals + Rake task.
  • P4: String vs symbol key mismatch — warns when validate or verify proc returns nil.
  • Pipeline eval cost — uses Pipeline::Trace#total_cost (all steps), not just last step.
  • Reload lifecycleload_evals! clears definitions before re-loading. Registry filters stale hosts.
  • Adapter isolationcompare_models and run_all_own_evals deep-dup context per run.

Verified with real API

Model                      Score       Cost  Avg Latency
---------------------------------------------------------
gpt-4.1-nano                0.67    $0.000032      687ms
gpt-4.1-mini                1.00    $0.000102     1070ms

Stats

  • 1077 tests, 0 failures
  • 3 architecture review rounds, 32 findings fixed
  • Verified with real OpenAI API (gpt-4.1-nano, gpt-4.1-mini)

0.1.0 (2026-03-20)

Initial release.

Features

  • Step abstractionRubyLLM::Contract::Step::Base with prompt DSL, typed input/output
  • Output schema — declarative structure via ruby_llm-schema, sent to provider for enforcement
  • Validate — business logic checks (1-arity and 2-arity with input cross-validation)
  • Retry with model escalation — start cheap, auto-escalate on contract failure or network error
  • Preflight limitsmax_input, max_cost, max_output refuse before calling the LLM
  • Pipeline — multi-step composition with fail-fast, timeout, token budget
  • Eval — offline contract verification with define_eval, run_eval, zero-verify auto-case
  • Adapters — RubyLLM (production), Test (deterministic specs)
  • RSpec matcherssatisfy_contract, pass_eval
  • Structured trace — model, latency, tokens, cost, attempt log per step

Robustness

  • 1005 tests, 0 failures
  • 42 bugs found and fixed via 10 rounds of adversarial testing
  • 0 RuboCop offenses
  • Parser handles: markdown code fences, UTF-8 BOM, JSON extraction from prose
  • SchemaValidator: full nested validation, additionalProperties, minItems/maxItems, minLength/maxLength
  • Deep-frozen parsed_output prevents mutation via shared references