Patch release: the ruby_llm_contract:optimize rake task now auto-loads in Rails apps. No behaviour change to the task itself.
ruby_llm_contract:optimizeno longer requires manualrequire "ruby_llm/contract/rake_task"in Rails apps. Pre-0.10.4 the docs claimed the task was "included" withRubyLLM::Contract::RakeTask, but the railtie did not load the file — adopters runningbin/rails ruby_llm_contract:optimizegotUnrecognized commanduntil they added the require toRakefileorlib/tasks/*.rake. The railtie now uses the standardrake_tasks { require "..." }idiom, lazy-loading the file only whenrakeis invoked (no boot cost).- Docs corrected in
docs/guide/optimizing_retry_policy.md— the "rake task is included" line now explicitly states it auto-loads on 0.10.4+ in Rails, and that non-Rails / older-Rails setups still need the explicitrequire.
Hot-fix release: the schema validator now correctly accepts nil on required-but-nullable fields. This unblocks OpenAI structured-output strict mode, where every property has to be in required and "nullable" is expressed as a null branch in anyOf/oneOf or as an array type — exactly the combination the prior validator wrongly rejected.
-
SchemaValidatorno longer rejects legalnilon required-nullable fields. Pre-0.10.3 (every published version 0.2.x–0.10.2) conflated "required" (must be present) with "non-nullable" (cannot be null) — JSON Schema treats those as orthogonal. The validator now honours all three idioms for nullability:type: ["string", "null"](array form)type: "null"(degenerate scalar form)anyOf: [{type: "string"}, {type: "null"}]andoneOfequivalents
Adopter impact: if you bypassed the bug by setting
required: false+ disabling OpenAIstrict: trueon every nullable field, you can now restorestrict: trueand keep the field required (= OpenAI's standard nullable idiom). Non-nullable required fields still rejectnilexactly as before; this is a strictly additive fix.Discovered via dogfooding in a production adopter using OpenAI strict structured output with 17 nullable fields (
"set the rest to null"prompt). Smoking-gun mutation: removenullable_schema?short-circuit fromvalidate_nil_field— the three new spec cases inspec/ruby_llm/contract/contract/schema_validator_spec.rbfail.
Patch release: ship the docs/guide/ directory inside the gem so adopters and LLM integration agents can read the manuals locally (via bundle show ruby_llm-contract or gem unpack) without an internet round-trip to GitHub. No code behavior change.
docs/guide/*is now packaged with the gem (14 files, ~120 KB). Previously the README's "See also" links pointed atdocs/guide/getting_started.md,docs/guide/optimizing_retry_policy.md, etc., but those files were stripped from the published gem - LLM integration agents (Cursor, Claude Code, Copilot) reported "no documentation in the gem" because the links 404'd locally.docs/ideas/anddoc/decisions/remain excluded.
models:keyword form documented + pinned for hash configs.retry_policy models: ["gpt-5-nano", { model: "gpt-5-mini", reasoning_effort: "high" }]is now covered by a spec and shown in getting_started.md. The block form (escalate(...)) and the keyword form share the same@configsstorage; both forms accept config hashes.reasoning_effortexamples corrected to gpt-5 family. Pre-0.10.2 docs and specs pairedreasoning_effortwithgpt-4.1-*model names, which is incorrect:gpt-4.1is not a reasoning model. Updated acrossdocs/guide/getting_started.md,docs/guide/optimizing_retry_policy.md,CHANGELOG.md, and 7 spec files. Non-reasoninggpt-4.1examples (model fallback chains withoutreasoning_effort) are unchanged.- Version mentions corrected in README and multimodal guide. README FAQ ("Upgraded to 0.9.0 - why?") now reads "Upgraded to 0.10.0 from 0.8.x";
docs/guide/multimodal_input.mdreferences0.10.0+and0.10.xinstead of0.9.0. 0.9.0 and 0.9.1 were tagged but never published to rubygems; adopters jump from 0.8.0 directly to 0.10.x.
Patch release fixing gem packaging. 0.10.0 was yanked from rubygems.org due to the issue documented below; 0.10.1 is the recommended upgrade target. No code behavior change vs 0.10.0.
- Gem no longer ships internal tracker / dev configs. Excluded from
spec.files:TODO.md,.rspec,.rubycritic.yml,.simplecov, and the.revive/directory. Pre-0.10.1 the published gem contained these files; adopters who already extracted 0.10.0 can safely delete them.
First published release since 0.8.0. Consolidates work originally tagged as 0.9.0 (multimodal input) and 0.9.1 (internal quality refactor), neither of which was pushed to rubygems. Adopters upgrading from 0.8.0 should read the Behavioural change and Breaking changes sections below before installing.
validate(description, &block)andDefinition#invariant(description, &block)now raiseArgumentErrorwhendescriptionisnilor empty. Pre-0.10.0 the empty descriptor was silently accepted and produced""entries inresult.validation_errors, making debugging impossible. Codex audit found zero production use sites acrosslib/,examples/,README— only the regression-marker test certifying the bug.
Ensure every validate / invariant call has a non-empty descriptor (this is already how every README example writes them):
# Before (silently accepted, produced "" in validation_errors):
validate("") { |o| o[:score].between?(0, 100) }
# After (required):
validate("score in range 0-100") { |o| o[:score].between?(0, 100) }- Multimodal input via
context: { attachment: ... }— pass a file/IO/URL throughStep.run(input, context: { attachment: path }); the adapter forwards it toRubyLLM::Chat#ask(content, with: attachment). RubyLLM normalises wire format per provider (Anthropic url/base64, OpenAIimage_url/file, Geminiinline_data). Multi-attachment supported natively (with: [pdf1, pdf2]orwith: { images: [...], pdfs: [...] }). See multimodal input guide and ADR-0022. attachment_token_estimate(n)class macro — adopter-declared conservative estimate of attachment input tokens. Applied to BOTH runtime (limit_checker) and pre-flight (estimate_cost) — same source of truth, no estimate/runtime drift.on_unknown_attachment_size(:refuse | :warn)class macro — mirrorson_unknown_pricingopt-out semantics. Defaults to:refuse. Never settable as global default — same invariant asmax_costfail-closed.
- Contracts with
max_costormax_inputANDcontext[:attachment]set AND noattachment_token_estimatedeclared → REFUSE with:limit_exceeded. This is fail-closed semantics: the gem cannot bound vision/PDF token cost without an adopter-declared estimate. Opt out per-step withon_unknown_attachment_size :warn. Text-only contracts and contracts withoutmax_cost/max_inputare unaffected.
run_eval(no args) return shape pinned toHash<String, Report>keyed by eval name. Documents the existing contract used byRubyLLM::Contract::RakeTask#collect_host_reportsand adopters. No runtime change vs 0.8.0 — only the spec assertion now locks the shape.Parser.parse(text, strategy: :json)first-bracket-wins boundary documented. Extraction commits to the first balanced{or[structure and does NOT retry on later candidates. Empty{}followed by real JSON parses as the empty Hash; non-JSON{braces}before real JSON raisesParseError. No runtime change — this codifies long-standing behavior with explicit boundary tests.
with_retry_disabledno longer mutates the step class's singleton method. The optimizer now passesretry_policy_override: nilthroughcontext:tocompare_models, whichStep::Base#runtime_settingsalready honours. Removes a concurrency hazard where two paralleloptimize_retry_policycalls on the same step would race on the singleton restore inensure.CostCalculator.find_modelexposed as a public class method. Removes twoCostCalculator.send(:find_model, ...)workarounds inStep::Base#estimate_cost. Theestimated_cost_forhelper is gone —estimate_costnow routes through the existing publicCostCalculator.calculate(model_name:, usage:).stub_stepunified on a single storage path. Both block and non-block forms now write toRubyLLM::Contract.step_adapter_overrides(thread-local). Thearound(:each)hook inrspec.rbhandles cleanup between examples. Removes the priorallow(step).to receive(:run)branch.
- Anti-facade audit complete: 89/89 spec files under per-test 17-mode walk (Phase A: 26 specs, Phase C: 63 specs via parallel Codex fan-out). Net +30 strengthened tests against mutation-blind assertions, zero public API change beyond the breaking entry above.
- Dead
ObjectSpace.each_object(Class)fallback removed inconcerns/eval_host.rb#register_subclasses. The gemspec requires Ruby>= 3.2.0, soClass#subclasses(Ruby ≥ 3.1) is always available; the legacy fallback was unreachable code that would have iterated all loaded classes O(n) and was not thread-safe.
add_historymulti-turn replay of prior attachments — single-turn multimodal supported; follow-up questions on the same document deferred to a later release.- Streaming + attachment — contract steps remain synchronous.
- Provider-specific attachment size caps — surface only via
attachment_token_estimatecalibration; consult provider docs.
- Suite: 1401 examples / 0 failures / 7 pending (was 1346/0/8 at 0.8.0).
Narrative repositioning + small API additions. Internal architecture unchanged: no Step::Base refactor, no breaking changes to existing DSL.
thinking(effort:, budget:)class macro onStep::Base— mirrorsRubyLLM::Agent.thinkingsignature exactly. Stored as{ effort:, budget: }hash; reader returns the hash; supports:defaultreset semantics; superclass inheritance likemodel/temperature. The convenience aliasreasoning_effort(:low)is implemented asthinking(effort: :low)— single normalized state, not separate ivar.- Adapter wiring for
with_thinking— whenthinkingis set on the Step class, OR whenreasoning_effort:is passed through context, OR when an attempt config inretry_policy escalate(...)carriesreasoning_effort:, the RubyLLM adapter resolves the effective{ effort:, budget: }hash and forwards it viachat.with_thinking(**)— provider-agnostic (supports OpenAIreasoning_effortAND Anthropic extended-thinking budget). Precedence: per-attempt / contextreasoning_effortoverrides class-levelthinking[:effort]; budget is taken from class-levelthinking[:budget]. Behavioural change vs 0.7.x:reasoning_effortis now forwarded viawith_thinkinginstead ofwith_params. Same wire-level OpenAI parameter; provider-agnostic Anthropic support is now automatic.
ruby_llmconstraint bumped from~> 1.0to~> 1.12—Chat#with_thinkingis the canonical path for reasoning effort + extended thinking; it shipped in RubyLLM 1.12. Adopters onruby_llm < 1.12need to bump RubyLLM before upgrading this gem to 0.8.0.
- Tagline + README opening — repositioned around "Contracts + Evals for RubyLLM". New "Relation to RubyLLM::Agent" section explicitly frames Step as a sibling abstraction (same niche as Agent, wider contract), not an alternative or foundation. README does not claim "Step uses Agent under the hood" — current call path is
Step → Runner → Adapters::RubyLLM → RubyLLM.chatdirectly. TokenEstimatordocumented as heuristic — module docstring expanded with explicit "±30% accuracy" framing. Refusal messages fromLimitCheckernow include(heuristic ±30%)suffix so adopters know the pre-flight number is estimated, not measured. RubyLLM 1.14 also has no pre-flight tokenizer;RubyLLM::Tokensis post-hoc only.CostCalculatorrepositioned in docs — module narrative reframed from "cost calculator" to "fine-tune pricing registry + lookup with fallback chain". Math methods (compute_cost,token_cost, etc.) were already private; this release makes the docs match. Public API surface unchanged:register_model,unregister_model,reset_custom_models!,calculate.output_schemareframed in docs — described as "wrapper aroundRubyLLM::Schema+ client-side validation step", not a standalone feature. The schema language is identical to whatRubyLLM::Agent.schemaaccepts; the difference is what wraps it.- README retry framing —
retry_policy escalate(...)(model escalation on validation failure) is the marketed default.retry_policy attempts: N(same-model retry) stays in the API for backward compat and niche cases (subjective criteria, multi-step pipelines, weaker models) but is no longer marketed as a recommended default. Empirical basis: four small experiments across PDF quiz generation, GSM8K math (n=30 + n=120), and multi-constraint schedule generation found no useful lift for nano-class models on tasks with clear correctness criteria.
- New disambiguation paragraphs in
prompt_ast.md(Step.input_typevsRubyLLM::Agent.inputs;Prompt::Buildermulti-role DSL vs Agent ERB single-string template loader),testing.md(Step.observevsChat#on_end_message/on_tool_call),output_schema.md(relation toAgent.schema), andoptimizing_retry_policy.md(orthogonal model + thinking dimensions). getting_started.mdrefusal message example updated to include the new(heuristic ±30%)suffix.
- #11 (Optimizer is blind to same-model attempts) — closed after empirical experiments.
attempts: Nretry stays in API; not marketed as a default. - #6 (Production cost reporting) — already implemented in 0.7.x; close confirmed.
output_schemaProc form for runtime-input-aware schemas (parity withAgent.schemaProc form). Additive, low-risk; deferred to 0.9 to keep 0.8 scope tight.- H4 (Step composing
RubyLLM::Agentinternally as config holder) — verified feasible but ROI insufficient for current adopter base; trigger-based revisit, no calendar commitment.
Adoption-friction release. No runtime behavior changes — every delta is in docs/, examples/, or spec/integration/ (plus the version.rb / Gemfile.lock bumps). Upgrading from 0.7.2 picks up the expanded guide set, the new runnable showcases, and one extra integration spec.
- New guide:
docs/guide/why.md— four production failure modes the gem exists for (schema-valid logically wrong, silent prompt regression, sampling variance on fixed-temperature models, runaway cost). Opens from a concrete incident each time; designed for readers who have not yet felt the pain the gem relieves. - New guide:
docs/guide/rails_integration.md— seven Rails-specific FAQs with runnable snippets: where step classes live (app/contracts/), initializer setup, background jobs,around_callobservability, RSpec/Minitest stubs, error handling in controllers, CI gate wiring. - README adoption-friction pass — added a short "Do I need this?" block after Install, a reading-order hint (
README → why.md → getting_started.md), and outcome-based labels in the docs index ("Prevent silent prompt regressions" instead of "Eval-First", etc.). - TL;DR box at the top of every guide — single-sentence orientation for readers who land via search; "Skip if" clause added where real confusion exists (
eval_first.md,testing.md,migration.md). - API coverage gaps closed —
estimate_cost/estimate_eval_cost,max_cost on_unknown_pricing: :warn,run_eval(..., concurrency:),around_calltesting patterns now documented ingetting_started.md,eval_first.md,testing.md. - Industry-standard terminology —
temperature-locked→fixed-temperature,variance-induced→sampling variance,severity signals→severity keywords,takeaway drift→tone/takeaways mismatch. docs/architecture.mdrefresh — diagram now reflects the current class layout: addedStep::RetryPolicy,Pipeline::Result,Eval::AggregatedReport,Eval::BaselineDiff,Eval::PromptDiffComparator,Eval::EvalHistory,Eval::RetryOptimizer,OptimizeRakeTask. Replaced the outdatedEval::TraitEvaluatorentry withEval::ExpectationEvaluator.- Business framing added to guides — every guide opens with a concrete production scenario or "why it matters" hook before the API reference.
The previous 12-file set mixed a private Reddit promo planner, customer support, meetings, keyword extraction, and translation. The new set is seven runnable files, each answering one adopter question on the README's SummarizeArticle case.
| # | File | Answers |
|---|---|---|
| 00 | 00_basics.rb |
How do I start? (seven incremental layers + real-LLM pointer) |
| 01 | 01_fallback_showcase.rb |
Show me the gem in 30 seconds (zero API keys) |
| 02 | 02_real_llm_minimal.rb |
How do I plug in a real LLM? (~30 lines) |
| 03 | 03_summarize_with_keywords.rb |
How does the contract evolve? (growing prompt) |
| 04 | 04_summarize_and_translate.rb |
Pipeline composition + pipeline-level run_eval |
| 05 | 05_eval_dataset.rb |
How do I stop silent prompt regressions? |
| 06 | 06_retry_variants.rb |
attempts: 3, reasoning_effort escalation, cross-provider (Ollama → Anthropic → OpenAI) |
Every file carries an "Expected output" block in its header so readers see the result without running the script. The docs/ideas/ directory is now fully untracked (already in .gitignore; one stray file removed from version control).
- Schema pitfall fixed in 5 files —
array :x do; string :y; ...; endsilently producesitems: stringand drops every declaration after the first, matching the documented pitfall inspec/ruby_llm/contract/nested_schema_spec.rb:71. Every affected array block is now wrapped inobject do...end. examples/05_eval_dataset.rb(pre-renumber:09_eval_dataset.rb)result[:passed]→result.passed?— the previous code called[]on anEval::CaseResultand raisedNoMethodErrorat runtime.
- New
spec/integration/pipeline_eval_spec.rb— three cases guaranteeing pipeline-levelrun_evalstays functional: happy path, final-step mismatch, and fail-fast propagation when an intermediatevalidaterejects. Closes the "09 STEP 5 pipeline evaluation" known issue flagged in the 0.7.2 release. The fail-fast case assertsstep_status == :validation_failedand the validate's label indetails, so a regression that short-circuits on schema instead of validate would fail loudly.
examples/01_classify_threads.rb,02_generate_comment.rb,03_target_audience.rb,10_reddit_full_showcase.rb,spec/integration/reddit_pipeline_spec.rb— Reddit Promo Planner was a separate private project; its examples do not belong in the gem's public repo.examples/02_output_schema.rb— fully covered bydocs/guide/output_schema.md; deleting avoids duplication.
- Terminal output labels renamed for consistency with README narrative.
print_summarynow printsHardest eval(wasConstraining eval),Suggested fallback list(wasSuggested chain), and the production-mode table usesfirst-attempt/fallback %as column headers (wassingle-shot/escalation). Programmatic metric names unchanged:single_shot_cost,single_shot_latency_ms,escalation_rate.RetryOptimizer::Resultexposeshardest_evalas an alias forconstraining_eval.
docs/guide/optimizing_retry_policy.mdrewritten — 17.7k → 6.4k characters. Continues theSummarizeArticlenarrative from README. Offline mode clearly positioned as wiring-check; real optimization runs viaLIVE=1 RUNS=3. Output samples match actualprint_summaryformat.docs/guide/getting_started.mdrewritten — 8.7k → 6.1k. Every example usesSummarizeArticle. Evals + CI gates section moved before Budget caps. Structured Prompts / Dynamic Prompts / "Already using ruby_llm?" / Reasoning effort sections removed; content delegated toprompt_ast.mdand README.docs/guide/eval_first.mdrefined — 6.3k → 5.0k. Switched toSummarizeArticlecase. Team workflow section compressed with links back togetting_started.mdfor the matcher chain.docs/guide/testing.mdrefined — 10.7k → 7.4k. Switched toSummarizeArticlecase. Threshold gating / Rake task / baseline walkthrough / prompt A/B sections delegated back togetting_started.mdandeval_first.md.docs/guide/output_schema.mdDSL bug fix — the Supported constraints table documented JSON Schema camelCase keys (minLength,minItems,additionalProperties) that are not valid DSL arguments. Every copy-paste from the previous table would have raisedArgumentError. Switched to snake_case (min_length,min_items,additional_properties) as the DSL actually expects; added a short note on the internal camelCase conversion.docs/guide/best_practices.md,pipeline.md,migration.mdsanity pass — terminology alignment (model escalation → model fallback where narrative;escalateDSL method unchanged) andSummarizeArticlecase where the guide is not inherently multi-step.
Step::Base#run_onceno longer swallows adapter-phaseArgumentErroras:input_error. The previous blanketrescue ArgumentErrorwas there to convert DSL misconfiguration (e.g. missingprompt) into an:input_errorResult. Side effect: programmer bugs in adapter code that raisedArgumentError(wrong arity, bad config argument) were silently coerced into:input_errorand retried as if the user had given bad input. Now the rescue is narrowed to the Runner-construction phase only — DSL configuration errors still produce:input_error(theprompt has not been setcase is regression-tested), butArgumentErrorraised from adapter code duringRunner#callpropagates to the caller. Input-type validation failures continue to produce:input_errorthroughInputValidator's own scoped rescue, unchanged.
:adapter_errorremoved fromDEFAULT_RETRY_ON. New default:[:validation_failed, :parse_error].ruby_llmalready retries transport errors (RateLimitError,ServerError,ServiceUnavailableError,OverloadedError, timeouts) at the Faraday layer, so the previous default re-ran the same model on errors the HTTP middleware already retried with backoff. To restore pre-0.7 behavior:retry_on :validation_failed, :parse_error, :adapter_error. Recommended pattern: pair:adapter_errorwithescalate "model_a", "model_b"— a different model/provider can bypass what transport retry could not.AdapterCallernarrowsrescuefromStandardErrortoRubyLLM::Error+Faraday::Error. Provider errors and transport errors that escape ruby_llm's Faraday retry middleware (Faraday::TimeoutError,Faraday::ConnectionFailed) still produce:adapter_erroras before. Programmer errors that are neither (NoMethodError, adapter code bugs) now propagate instead of being silently converted to:adapter_errorand retried. Known limitation: adapter code raisingArgumentErroris still coerced into:input_errorbyStep::Base#run_once(which rescuesArgumentErrorfor input-type validation). Disambiguating adapter-ArgumentError vs input-validation-ArgumentError requires arun_oncerefactor and is tracked as a follow-up.
If you rely on the old behavior, opt in explicitly:
retry_policy do
attempts 3
retry_on :validation_failed, :parse_error, :adapter_error
endOr better, with a model fallback chain:
retry_policy do
escalate "gpt-4.1-nano", "gpt-4.1-mini"
retry_on :validation_failed, :parse_error, :adapter_error
endproduction_mode:oncompare_modelsandoptimize_retry_policy— measures retry-aware, end-to-end cost per successful output. Passproduction_mode: { fallback: "gpt-5-mini" }and each candidate runs with a runtime-injected[candidate, fallback]retry chain. The report exposesescalation_rate,single_shot_cost, andeffective_costso "the cheaper candidate" decision matches production cost rather than first-attempt cost.- New Report metrics —
escalation_rate,single_shot_cost,effective_cost,single_shot_latency_ms,effective_latency_ms,latency_percentiles(p50/p95/max).AggregatedReportaverages all of them acrossruns:. - Extended
ModelComparison#table— whenproduction_mode:is set, renders aChaincolumn (candidate → fallback) withsingle-shot,escalation,effective cost,latency,score. Edge casecandidate == fallbackrenders as a single model and—in the escalation column, with retry injection skipped entirely soeffective == single-shotby construction, not by coincidence. context[:retry_policy_override]— new context key that nullifies or replaces class-levelretry_policyfor a single call. Used internally by production-mode injection; safe to use directly when you need a transient override that doesn't mutate the step class.
- Single-fallback (2-tier) chains only. Multi-tier chains can be inspected post-hoc via
trace.attemptsbut aren't summarized in the optimize table. - Costs with
runs: 3 + production_mode: { fallback: "gpt-5-mini" }are ≈3× a single-shot eval plus the actual retry attempts — not 6×. Production-mode metrics come from a single pass. - Step-only. Calling
compare_modelswithproduction_mode:on aPipeline::Basesubclass raisesArgumentError— retry injection is Step-level and pipeline-wide fallback semantics aren't defined yet. Benchmark individual steps.
- Guide: Production-mode cost measurement — API, metric interpretation, 2-tier scope note.
runs:parameter oncompare_modelsandoptimize_retry_policy— runs each candidate N times per eval and aggregates the mean score, mean cost per run, and mean latency. Reduces sampling variance in live mode where LLM outputs are non-deterministic (gpt-5 family enforcestemperature=1.0server-side, so a single unlucky sample can misclassify a viable candidate as "failing"). Defaultruns: 1— backward compatible.RUNS=Nonrake ruby_llm_contract:optimize— CLI flag for variance-aware optimization.Eval::AggregatedReport— duck-typeReportexposingscore(mean),score_min/score_max(spread),total_cost(mean per run),pass_rate(clean-pass count x/N), andclean_passes.- Guide: Reducing variance with
runs:— when to use it and why.
Step.optimize_retry_policy— runscompare_modelson ALL evals for the step, builds a score matrix, identifies the constraining eval, and suggests a retry chain. Chain's last model always passes all evals (safe fallback).rake ruby_llm_contract:optimize— one-command retry chain optimization. Prints score table, constraining eval, suggested chain, and copy-paste DSL.- Offline by default —
optimizeusessample_response(zero API calls) unlessLIVE=1orPROVIDER=is set. EVAL_DIRS=support — non-Rails setups can specify eval file directories.- Guide: Optimizing retry_policy — full procedure with prerequisites, troubleshooting, and real-world example.
- Chain semantics aligned with
retry_executor— retry fires onvalidation_failed/parse_error, not on low eval score. Disjoint eval coverage (A passes e1, B passes e2, neither passes both) correctly returns empty chain. - Removed ActiveSupport dependency from rake task (
.presence→.empty?). - Added
require "set"for non-Rails environments.
- Multi-provider operator tooling — rake tasks support
PROVIDER=openai|anthropic|ollama,CANDIDATES=model@effort,..., andREASONING_EFFORT=low|medium|high. rake ruby_llm_contract:recommend— wrapsStep.recommendwith CLI interface, prints best config, retry chain, DSL, rationale, and savings.- Ollama support —
PROVIDER=ollamawith configurableOLLAMA_API_BASE.
"What should I do?" — model + configuration recommendation.
Step.recommend—ClassifyTicket.recommend("eval", candidates: [...], min_score: 0.95)runs eval on all candidates and returns aRecommendationwith optimal model, retry chain, rationale, savings vs current config, andto_dslcode output.- Candidates as configurations —
candidates:accepts{ model:, reasoning_effort: }hashes, not just model name strings.gpt-5-miniwithreasoning_effort: "low"is a different candidate than with"high". compare_modelsextended — newcandidates:parameter alongside existingmodels:(backward compatible). Candidate labels include reasoning effort in output table.- Per-attempt
reasoning_effortin retry policies —escalateaccepts config hashes:escalate({ model: "gpt-5-nano" }, { model: "gpt-5-mini", reasoning_effort: "high" }). Each attempt gets its own reasoning_effort forwarded to the provider. pass_rate_ratio— numeric float (0.0–1.0) onReportandReportStats, complementing the stringpass_rate("3/5").- History entries enriched —
save_history!acceptsreasoning_effort:and storesmodel,reasoning_effort,pass_rate_ratioin JSONL entries.
v0.2: "Which model?" → compare_models (snapshot)
v0.3: "Did it change?" → baseline regression (binary)
v0.4: "Show me the trend" → eval history (time series)
v0.5: "Which prompt is better?" → compare_with (A/B testing)
v0.6: "What should I do?" → recommend (actionable advice)
reasoning_effortforwarded to provider —context: { reasoning_effort: "low" }now passed throughwith_paramsto the LLM. Previously accepted as a known context key but silently ignored by the RubyLLM adapter.
Data-Driven Prompt Engineering.
observeDSL — soft observations that log but never fail.observe("scores differ") { |o| o[:a] != o[:b] }. Results inresult.observations. Logged viaContract.loggerwhen they fail. Runs only when validation passes.compare_with— prompt A/B testing.StepV2.compare_with(StepV1, eval: "regression", model: "nano")returnsPromptDiffwithimprovements,regressions,score_delta,safe_to_switch?. ReusesBaselineDiffinternally.- RSpec
compared_withchain —expect(StepV2).to pass_eval("x").compared_with(StepV1).without_regressionsblocks merge if new prompt regresses any case.
v0.2: "Which model?" → compare_models (snapshot)
v0.3: "Did it change?" → baseline regression (binary)
v0.4: "Show me the trend" → eval history (time series)
v0.5: "Which prompt is better?" → compare_with (A/B testing)
Audit hardening — 18 bugs fixed across 4 audit rounds.
- RakeTask history before abort —
track_historynow saves all reports (pass and fail) before gating, so failed runs appear in eval history. - RSpec/Minitest stub scoping — block form
stub_stepuses thread-local overrides with real cleanup. Non-blockstub_all_stepsauto-restored by RSpecaround(:each)hook and Minitestsetup/teardown. - StepAdapterOverride — handles
context: niland respects string key"adapter". Moved tocontract.rbso both test frameworks share one mechanism. - max_cost fail closed output estimate — preflight uses 1x input tokens as output estimate when
max_outputnot set, preventing cost bypass for output-expensive models. - reset_configuration! clears overrides —
step_adapter_overridesnow cleared on reset. - CostCalculator.register_model — validates
Numeric,finite?, non-negative. Rejects NaN, Infinity, strings, nil. - Pipeline token_budget — rejects negative and zero values (parity with
timeout_ms). - track_history model fallback — uses step DSL
model, thendefault_modelwhen context has no model. Handles string key"model". - estimate_cost / estimate_eval_cost — falls back to step DSL model when no explicit model arg given.
- stub_steps string keys — both RSpec and Minitest normalize string-keyed options with
transform_keys(:to_sym). - DSL
:defaultreset —model(:default),temperature(:default),max_cost(:default)reset inherited parent values.
stub_steps(plural) — stub multiple steps with different responses in one block. No nesting needed. Works in RSpec and Minitest:stub_steps( ClassifyTicket => { response: { priority: "high" } }, RouteToTeam => { response: { team: "billing" } } ) { TicketPipeline.run("test") }
Production feedback release.
stub_stepblock form —stub_step(Step, response: x) { test }auto-resets adapter after block. Works in RSpec and Minitest. Eliminates leaked test state.- Minitest per-step routing —
stub_step(StepA, ...)now actually routes to StepA only (was setting global adapter, ignoring step class). track_historyin RakeTask —t.track_history = trueauto-appends every eval run (pass and fail) to.eval_history/. Drift detection without manualsave_history!calls.max_costfail closed — unknown model pricing now refuses the call instead of silently skipping. Seton_unknown_pricing: :warnfor old behavior.CostCalculator.register_model— register pricing for custom/fine-tuned models:register_model("ft:gpt-4o", input_per_1m: 3.0, output_per_1m: 6.0).
- RakeTask lazy context —
t.contextnow accepts a Proc, resolved at task runtime (after:environment). Fixes adapter not being available at Rake load time in Rails apps.
- RakeTask
:environmentfix — usesdefined?(::Rails)instead ofRake::Task.task_defined?(:environment). Works in Rails 8 without manualRake::Task.enhance. - Concurrent eval deterministic —
clone_for_concurrencyprotocol,ContextHelpersextracted. - README — added eval history, concurrency, quality tracking examples.
Observability & Scale — see what changed, run it fast, debug it easily.
- Structured logging —
Contract.configure { |c| c.logger = Rails.logger }. Auto-logs model, status, latency, tokens, cost on everystep.run. - Batch eval concurrency —
run_eval("regression", concurrency: 4). Parallel case execution via Concurrent::Future. 4x faster CI for large eval suites. - Eval history & trending —
report.save_history!appends to JSONL.report.eval_historyreturnsEvalHistorywithscore_trend,drift?, run-by-run scores. - Pipeline per-step eval —
add_case(..., step_expectations: { classify: { priority: "high" } }). See which step in a pipeline regressed. - Minitest support —
assert_satisfies_contract,assert_eval_passes,stub_stepfor Minitest users.require "ruby_llm/contract/minitest".
v0.2: "Which model?" → compare_models (snapshot)
v0.3: "Did it change?" → baseline regression (binary)
v0.4: "Show me the trend" → eval history (time series)
"Which step changed?" → pipeline per-step eval
"Run it fast" → batch concurrency
- Trait missing key = error —
expected_traits: { title: 0..5 }on output{}now fails instead of silently passing. - nil input in dynamic prompts —
run(nil)withprompt { |input| ... }correctly passes nil to block. - Defensive sample pre-validation —
sample_responseuses the same parser as runtime (handles code fences, BOM, prose around JSON). - Baseline diff excludes skipped — self-compare with skipped cases no longer shows artificial score delta.
- Zeitwerk eval/ ignore —
eager_load_contract_dirs!ignoreseval/subdirs before eager load.
- Recursive array/object validation — nested arrays (
array of array of string) validated recursively. Object items validated even without:properties(e.g.additionalProperties: false). - Deep symbolize in sample pre-validation — array samples with string keys (
[{"name" => "Alice"}]) correctly symbolized before schema validation.
- String constraints in SchemaValidator —
minLength/maxLengthenforced for root and nested strings. - Array item validation — scalar items (string, integer) validated against items schema type and constraints.
- Non-JSON sample_response fails fast —
sample_response("hello")with object schema raises ArgumentError at definition time instead of silently passing. max_tokensin KNOWN_CONTEXT_KEYS — no more spurious "Unknown context keys" warning.- Duplicate models deduplicated —
compare_models(models: ["m", "m"])runs model once.
- SchemaValidator validates non-object roots — boolean, integer, number, array root schemas now enforce type, min/max, enum, minItems/maxItems. Previously only object schemas were validated.
- Removed passing cases = regression —
regressed?returns true when baseline had passing cases that are now missing. Prevents gate bypass by deleting eval cases. - JSON string sample_response fixed —
sample_response('{"name":"Alice"}')correctly parsed for pre-validation instead of double-encoding. context[:max_tokens]forwarded — overrides step'smax_outputfor adapter call AND budget precheck.
- Skipped cases visible in regression diff — baseline PASS → current SKIP now detected as regression by
without_regressionsandfail_on_regression. - Skip only on missing adapter — eval runner no longer masks evaluator errors as SKIP. Only "No adapter configured" triggers skip.
- Array/Hash sample pre-validation —
sample_response([{...}])correctly validated against schema instead of silently skipping. assume_model_exists: falseforwarded — booleanfalseno longer dropped by truthiness check in adapter options.- Duplicate case names caught at definition —
add_case/verifywith same name raises immediately, not at run time.
- Array response preserved —
Adapters::RubyLLMno longer stringifies Array content. Steps withoutput_type Arraywork correctly. - Falsy prompt input —
run(false)andbuild_messages(false)passfalseto dynamic prompt blocks instead of falling back toinstance_eval. retry_onflatten —retry_on([:a, :b])no longer wraps in nested array.- Builder reset —
Prompt::Builderresets nodes on each build (no accumulation on reuse). - Pipeline false output —
output: falseno longer shows "(no output)" in pretty_print.
Fixes from persona_tool production deployment (4 services migrated).
- Proc/Lambda in
expected_traits—expected_traits: { score: ->(v) { v > 3 } }now works. - Zeitwerk eager-load —
load_evals!eager-loadsapp/contracts/andapp/steps/before loading eval files. Fixes uninitialized constant errors in Rake tasks. - Falsy values —
expected: false,input: false,sample_response(nil)all handled correctly. - Context key forwarding —
provider:andassume_model_exists:forwarded to adapter.schema:andmax_tokens:are step-level only (no split-brain). - Deep-freeze immutability — constructors never mutate caller's data.
Baseline regression detection — know when quality drops before users do.
report.save_baseline!— serialize eval results to.eval_baselines/(JSON, git-tracked)report.compare_with_baseline— returnsBaselineDiffwith regressions, improvements, score_delta, new/removed casesdiff.regressed?— true when any previously-passing case now failswithout_regressionsRSpec chain —expect(Step).to pass_eval("x").without_regressions- RakeTask
fail_on_regression— blocks CI when regressions detected - RakeTask
save_baseline— auto-save after successful run - Migration guide —
docs/guide/migration.mdwith 7 patterns for adopting the gem in existing Rails apps
- 1086 tests, 0 failures
Production hardening from senior Rails review panel.
around_callpropagates exceptions — no longer silently swallows DB errors, timeouts, etc. User who wants swallowing can rescue in their block.- Nil section content skipped —
section "X", nilno longer renders"null"to the LLM. Section is omitted entirely. - Range support in
expected:—expected: { score: 1..5 }works inadd_case. Previously only Regexp was supported. Trace#dig—trace.dig(:usage, :input_tokens)works on both Step and Pipeline traces.
Fixes from first real-world integration (persona_tool).
around_callfires per-run — not per-attempt. With retry_policy, callback fires once with final result. Signature:around_call { |step, input, result| ... }Result#tracealwaysTraceobject — never bare Hash.result.trace.modelworks on success AND failure.around_callexception safe — warns and returns result instead of crashing.modelDSL —model "gpt-4o-mini"per-step. Priority: context > step DSL > global config.- Test adapter
raw_outputalways String — Hash/Array normalized to.to_json. Trace#dig—trace.dig(:usage, :input_tokens)works.
Production DX improvements from first real-world integration (persona_tool).
temperatureDSL —temperature 0.3in step definition, overridable viacontext: { temperature: 0.7 }. RubyLLM handles per-model normalization natively.around_callhook — callback for logging, metrics, observability. Replaces need for custom middleware.build_messagespublic — inspect rendered prompt without running the step.stub_stepRSpec helper —stub_step(MyStep, response: { ... })reduces test boilerplate. Auto-included viarequire "ruby_llm/contract/rspec".estimate_cost/estimate_eval_cost— predict spend before API calls.
- Reload lifecycle —
load_evals!clears definitions before re-loading. Railtie hooksconfig.to_preparefor development reload.define_evalwarns on duplicate name (suppressed during reload). - Pipeline eval cost — uses
Pipeline::Trace#total_cost(all steps), not just last step. - Adapter isolation —
compare_modelsandrun_all_own_evalsdeep-dup context per run. - Offline mode — cases without adapter return
:skippedinstead of crashing. Skipped cases excluded from score. expected_traitsreachable fromdefine_evalDSL viaadd_case.verifyraises when both positional andexpect:keyword provided.best_forexcludes zero-score models from recommendation.print_summaryreplacespretty_print(avoidsKernel#pretty_printshadow).CaseResult#to_hround-trips correctly (name:key).
- All 5 guides updated for v0.2 API
- Symbol keys documented
- Retry model priority documented
- Test adapter format documented
- 1077 tests, 0 failures
- 3 architecture review rounds, 32 findings fixed
Contracts for LLM quality. Know which model to use, what it costs, and when accuracy drops.
report.resultsreturnsCaseResultobjects instead of hashes. Useresult.name,result.passed?,result.scoreinstead ofresult[:case_name],result[:passed].CaseResult#to_hfor backward compat.report.print_summaryreplacesreport.pretty_print(avoids shadowingKernel#pretty_print).
add_caseindefine_eval—add_case "billing", input: "...", expected: { priority: "high" }with partial matching. Supportsexpected_traits:for regex/range matching.CaseResultvalue objects —result.name,result.passed?,result.output,result.expected,result.mismatches(structured diff),result.cost,result.duration_ms.report.failures— returns only failed cases.report.skippedcounts skipped (offline) cases.- Model comparison —
Step.compare_models("eval", models: %w[nano mini full])runs same eval across models. Returns table with score/cost/latency per model.comparison.best_for(min_score: 0.95)returns cheapest model meeting threshold. - Cost tracking —
report.total_cost,report.avg_latency_ms, per-caseresult.cost. Pipeline eval uses total pipeline cost, not just last step. - Cost prediction —
Step.estimate_cost(input:, model:)andStep.estimate_eval_cost("eval", models: [...])predict spend before API calls. - CI gating —
pass_eval("regression").with_minimum_score(0.8).with_maximum_cost(0.01). RakeTask with suite-levelminimum_scoreandmaximum_cost. RubyLLM::Contract.run_all_evals— discovers all Steps/Pipelines with evals, runs them all. Includes inherited evals.RubyLLM::Contract::RakeTask—rake ruby_llm_contract:evalwithminimum_score,maximum_cost,fail_on_empty,eval_dirs.- Rails Railtie — auto-loads eval files via
config.after_initialize+config.to_prepare(supports development reload). - Offline mode — cases without adapter return
:skippedinstead of crashing. Skipped cases excluded from score/passed. - Safe
define_eval— warns on duplicate name; suppressed during reload.
- P1: Eval files not autoloaded by Rails — Railtie uses
load(not Zeitwerk). Hooks into reloader for dev. - P2: report.results returns raw Hashes — now returns
CaseResultobjects. - P3: No way to run all evals at once —
Contract.run_all_evals+ Rake task. - P4: String vs symbol key mismatch — warns when
validateorverifyproc returns nil. - Pipeline eval cost — uses
Pipeline::Trace#total_cost(all steps), not just last step. - Reload lifecycle —
load_evals!clears definitions before re-loading. Registry filters stale hosts. - Adapter isolation —
compare_modelsandrun_all_own_evalsdeep-dup context per run.
Model Score Cost Avg Latency
---------------------------------------------------------
gpt-4.1-nano 0.67 $0.000032 687ms
gpt-4.1-mini 1.00 $0.000102 1070ms
- 1077 tests, 0 failures
- 3 architecture review rounds, 32 findings fixed
- Verified with real OpenAI API (gpt-4.1-nano, gpt-4.1-mini)
Initial release.
- Step abstraction —
RubyLLM::Contract::Step::Basewith prompt DSL, typed input/output - Output schema — declarative structure via ruby_llm-schema, sent to provider for enforcement
- Validate — business logic checks (1-arity and 2-arity with input cross-validation)
- Retry with model escalation — start cheap, auto-escalate on contract failure or network error
- Preflight limits —
max_input,max_cost,max_outputrefuse before calling the LLM - Pipeline — multi-step composition with fail-fast, timeout, token budget
- Eval — offline contract verification with
define_eval,run_eval, zero-verify auto-case - Adapters — RubyLLM (production), Test (deterministic specs)
- RSpec matchers —
satisfy_contract,pass_eval - Structured trace — model, latency, tokens, cost, attempt log per step
- 1005 tests, 0 failures
- 42 bugs found and fixed via 10 rounds of adversarial testing
- 0 RuboCop offenses
- Parser handles: markdown code fences, UTF-8 BOM, JSON extraction from prose
- SchemaValidator: full nested validation, additionalProperties, minItems/maxItems, minLength/maxLength
- Deep-frozen parsed_output prevents mutation via shared references