Skip to content

Latest commit

 

History

History
133 lines (92 loc) · 7.95 KB

File metadata and controls

133 lines (92 loc) · 7.95 KB

Find the cheapest viable fallback list

Read this when you have 2+ evals and want to know empirically — not by guessing — which models belong in retry_policy and in what order.

You defined SummarizeArticle in the README with retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]. That list was a guess. optimize_retry_policy tells you which models your evals actually need, so you stop paying for the strong model when nano was enough — or stop shipping nano when the hardest eval proves it isn't.

Requirements

  • SummarizeArticle already has retry_policy. If your step has none, add one first (getting started).
  • 2–3 evals per step. One eval optimizes for one scenario; with only smoke, you get a recommendation that passes smoke but may miss production edge cases. See eval-first.
  • Rake task ruby_llm_contract:optimize is registered when lib/ruby_llm/contract/rake_task.rb is loaded. In Rails apps on 0.10.4+ the railtie auto-loads it lazily during rake invocation — no manual setup needed. On older Rails versions or in non-Rails projects, add require "ruby_llm/contract/rake_task" to your Rakefile (or any file under lib/tasks/) and set EVAL_DIRS=... on the command line.

Two orthogonal dimensions to a retry chain. A chain element is { model:, reasoning_effort: } — model identity AND thinking budget. optimize_retry_policy explores both. You can pin the default thinking config at class level too — see the thinking DSL note at the end.

For this guide, assume SummarizeArticle has three evals:

SummarizeArticle.define_eval("smoke")          { ... }  # short news article
SummarizeArticle.define_eval("dense_article")  { ... }  # long form, 5 takeaways required
SummarizeArticle.define_eval("critical_tone")  { ... }  # negative review, tone must match

Offline check first

Run once offline to verify the wiring:

rake ruby_llm_contract:optimize \
  STEP=SummarizeArticle \
  CANDIDATES=gpt-4.1-nano,gpt-4.1-mini@low,gpt-4.1-mini,gpt-4.1

The @low (or @medium, @high) suffix on a candidate is CLI shorthand for { model: "gpt-4.1-mini", reasoning_effort: :low } — saves you constructing hash strings on the command line. Plain model names without @ use the Step's default thinking config (or none).

Offline uses each eval's sample_response — zero API calls. Every candidate gets the same score because they all receive the canned response. That's fine for a smoke test (verifying evals load, candidates parse, output renders) but it doesn't compare model quality. For real optimization, go live.

Optimize against real models

LIVE=1 RUNS=3 rake ruby_llm_contract:optimize \
  STEP=SummarizeArticle \
  CANDIDATES=gpt-4.1-nano,gpt-4.1-mini@low,gpt-4.1-mini,gpt-4.1

LIVE=1 makes real API calls. RUNS=3 averages each (candidate, eval) pair over three runs — necessary because OpenAI forces temperature=1.0 on gpt-5 / o-series and the same pair can score 0.00 on one run and 1.00 on the next.

Output (illustrative):

SummarizeArticle — fallback list optimization

  eval             4.1-nano  4.1-mini@low  4.1-mini   4.1
  ---------------------------------------------------------
  smoke                1.00          1.00      1.00  1.00
  dense_article        0.67 ←        1.00      1.00  1.00
  critical_tone        0.50 ←        0.67 ←    1.00  1.00

  Hardest eval: critical_tone

  Suggested fallback list:
    gpt-4.1-nano         — covers 1 eval(s)
    gpt-4.1-mini         — passes all 3 evals

  DSL:
    retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini]

Reading the table:

  • marks scores below threshold in the hardest eval. Not a selection hint — just "this candidate fails the row that matters most".
  • Hardest eval = the one that forces the strong fallback. Here, critical_tone demands gpt-4.1-mini.
  • Suggested fallback list = the shortest chain where each step covers more evals, built greedy-cheapest-first. Stops when all evals pass. Order matters: gpt-4.1-nano is tried first; on validation failure, the gem falls back to gpt-4.1-mini.

Copy the DSL, paste into your step, verify with rake ruby_llm_contract:eval. You just dropped gpt-4.1 from the chain — most requests finish on nano, mini handles what nano misses, and the strong model was never needed.

Measure effective cost before shipping

optimize shows first-attempt cost. In production, a candidate whose validator rejects 20% of outputs actually costs first_try_cost + fallback_cost × 0.20 per successful output. The first-attempt number hides this.

production_mode: is an optional compare_models kwarg. With { fallback: "model" } it runs each candidate with a runtime [candidate, fallback] chain and reports effective cost (first-try + fallback cost weighted by rejection rate). Without it, you get only first-attempt metrics (the default).

SummarizeArticle.compare_models(
  "dense_article",
  candidates: [{ model: "gpt-5-nano" }, { model: "gpt-5-mini", reasoning_effort: "low" }],
  production_mode: { fallback: "gpt-5-mini" }
).print_summary

Output (live mode, illustrative):

dense_article — model comparison

  Chain                                    first-attempt  fallback %  effective cost  latency   score
  ---------------------------------------------------------------------------------------------------
  gpt-5-nano → gpt-5-mini                  $0.0010        33%         $0.0018         164ms     1.00
  gpt-5-mini (effort: low) → gpt-5-mini    $0.0015         5%         $0.0016         210ms     1.00
  gpt-5-mini                               $0.0030         —          $0.0030         220ms     1.00
  • first-attempt — cost of the first run alone.
  • fallback % — fraction of cases where the validator rejected and the fallback ran.
  • effective cost — total per successful output including retries.
  • — candidate equals fallback, no chain to observe.

Run this before finalizing: a candidate saving 3× on first-attempt but falling back 60% of the time may save only 1.2× in production.

Scope. Single-fallback (2-tier) chains only. Multi-tier inspect via trace.attempts. Step-level — calling on Pipeline::Base raises ArgumentError.

When results look wrong

  • "No viable chain" from a single live run. Re-run with RUNS=3. If scores jump, the first run was noise. Never trust single-run results with gpt-5 / o-series in the pool — temperature=1.0 is server-enforced.
  • Every candidate fails the same eval, including the strongest. The eval is rejecting correct answers. Run the step directly (context: { retry_policy_override: nil, model: "gpt-4.1" }), inspect the output, compare with the verify block. Loosen the eval if the output is correct but not one of the accepted values.
  • Testing one specific hypothesis. (e.g. "does mini@medium help on critical_tone?") Use SummarizeArticle.compare_models("critical_tone", candidates: [{ model: "gpt-5-mini", reasoning_effort: "medium" }], runs: 3) directly — three calls instead of rerunning the whole optimize pass.

Programmatic API names

Metrics exposed on Report / AggregatedReport keep their original names: single_shot_cost, single_shot_latency_ms, escalation_rate. The optimize Result struct also exposes hardest_eval as an alias for constraining_eval.

thinking DSL note

Set the default reasoning effort once on the Step class — mirrors RubyLLM::Agent.thinking exactly:

class SummarizeArticle < RubyLLM::Contract::Step::Base
  model "gpt-5-nano"
  thinking effort: :low          # canonical
  # or
  reasoning_effort :low          # alias for thinking(effort: :low)
end

Forwarded to Chat#with_thinking(**) through the adapter — works provider-agnostically (OpenAI reasoning_effort, Anthropic extended-thinking budget). A per-call override via context: { reasoning_effort: :high } still wins over the class default.