Tool-using research agent that turns a single ticker into a citation-grounded, chart-illustrated quarterly brief in under two minutes.
Sister project of multi-horizon-financial-llm — the Gemma LoRA adapter trained there will plug in as a swappable synthesis backend in Phase 2 (A/B vs Claude Opus, paired-t methodology continuous with the sister repo's eval).
Status — v0.1 (MVP):
earnings_recapend-to-end works on tickers indexed in the sister repo (69 S&P 500). Factuality baseline pinned at 1.00 (17/17 claims verified) on the NVDA reference run — seeeval/runs/. Anthropic is wired into thesynthesizer_bslot for the Phase 2 A/B harness; v0.1 ships with Gemini 2.5 because Vertex's new-project quota algorithm rejected six Anthropic quota requests in a row (decision D-009).
$ mhfa earnings-recap NVDA --output ./outputs
brief : outputs/NVDA_20260426_brief.md
chart : outputs/NVDA_20260426_chart.png
raw : outputs/NVDA_20260426_raw.json
tools : 5 calls, 0 errorsTotal wall time on the reference NVDA run: ~2 s tool layer · ~30 s
synthesis (Gemini 2.5 Pro) · ~30 s factuality eval (Flash). Per-tool
timing from _metadata.calls in outputs/NVDA_20260426_raw.json:
| Tool | Args | Duration | OK |
|---|---|---|---|
sec.fetch_latest_10q |
NVDA |
0 ms (local cache hit) | ✓ |
sec.fetch_recent_8k |
NVDA, max=5 |
0 ms (local cache hit) | ✓ |
market.get_quote_history |
NVDA, 3mo |
702 ms | ✓ |
market.get_company_info |
NVDA |
329 ms | ✓ |
search.web_search |
"NVDA latest earnings analyst reaction" |
923 ms | ✓ |
|
Brief render (GitHub preview · raw markdown) |
Price chart (3-month, auto-generated) |
The brief is structured (exec summary → financial highlights table → price action + chart → recent catalysts → key risks → sources). Every numeric claim is required to cite a source already in the raw tool output (the synthesizer's hard rule), and the factuality eval verifies that rule held.
| Ticker | Score | Verified | Source |
|---|---|---|---|
| NVDA | 1.00 | 17 / 17 | eval/runs/NVDA_20260426_factuality.json |
| AAPL | 0.958 | 23 / 24 | eval/runs/AAPL_20260426_factuality.json |
| MSFT | 0.944 | 17 / 18 | eval/runs/MSFT_20260426_factuality.json |
| META | 1.00 | 11 / 11 | eval/runs/META_20260426_factuality.json |
| JPM | 0.955 | 21 / 22 | eval/runs/JPM_20260426_factuality.json |
LLM-as-judge (Gemini 2.5 Flash) extracts every factual claim from the brief
and verifies each against raw_data. See HOW_IT_WORKS.md →
"Empirical results" for
a deeper read of what the eval catches and what it doesn't.
Sample factuality run output — NVDA (perfect) and MSFT (one flag)
// eval/runs/NVDA_20260426_factuality.json
{
"score": 1.0,
"total_claims": 17,
"verified_claims": 17,
"flagged": []
}// eval/runs/MSFT_20260426_factuality.json
{
"score": 0.944,
"total_claims": 18,
"verified_claims": 17,
"flagged": [
{
"claim": "EPS (diluted) +59.8%",
"verdict": "contradicted",
"reason": "The source states that Diluted EPS is $5.16 (vs $3.23 YoY), which implies a YoY increase of 59.75%, not 59.8%."
}
]
}The MSFT flag is a rounding-precision strict-mode hit: the synthesizer
rounded $5.16 / $3.23 to +59.8%; the judge computed 59.75% and
called it contradicted. Strict-mode signal if you care about exact
reproducibility, arguably noise otherwise — kept as-is for v0.1 because
explicit miscalibration is more useful than silent agreement. The full
flagged-claim taxonomy across all 5 runs is in HOW_IT_WORKS § Empirical
results.
User: "earnings-recap NVDA"
│
▼
┌──────────────┐
│ Planner │ Gemini 2.5 Flash → ordered ToolCall list
└──────┬───────┘
▼
┌──────────────┐
│ Executor │ sequential, fail-soft per tool
└──────┬───────┘
▼
┌──────────┬──────────┼──────────┬──────────┐
▼ ▼ ▼ ▼ ▼
┌───────┐ ┌───────┐ ┌────────┐ ┌────────┐ ┌────────┐
│SEC 10Q│ │SEC 8-K│ │yfinance│ │company │ │ Tavily │
└───────┘ └───────┘ └────────┘ │ info │ │ search │
└────────┘ └────────┘
│
▼ raw evidence (JSON)
┌──────────────┐
│ Synthesizer │ Gemini 2.5 Pro → markdown brief
└──────┬───────┘ (synthesizer_b: Opus 4.7, Phase 2)
▼
┌──────────────┐
│ Factuality │ Gemini 2.5 Flash → verified/total
│ judge │ (thinking_budget: 0, JSON mode)
└──────────────┘
Layered model routing (configs/models.yaml) — Gemini 2.5 Flash for the
planner / mid-loop summaries / judge (cost-and-latency tier), Gemini 2.5 Pro
for the user-facing synthesizer (quality tier). One file controls every model
choice; cost/quality ablations are config changes, not code changes. Phase 2
turns this into a three-way A/B: Gemini 2.5 Pro vs Claude Opus 4.7 vs
Multi-Horizon Gemma adapter through the synthesizer_b / synthesizer_c
slots.
Tool layer is plain Python in v0.1. MCP wrapping is a Phase 2 task — the function signatures are deliberately MCP-shaped (single dict in, single dict out) so the wrapping is mechanical.
git clone https://github.com/srx7703/multi-horizon-financial-agent
cd multi-horizon-financial-agent
pip install -e ".[dev]" # or: uv sync
cp .env.example .env
$EDITOR .env # fill TAVILY_API_KEY, GCP_PROJECT_ID, MHFA_LOCAL_SEC_DIR
mhfa earnings-recap NVDA| Var | Why |
|---|---|
GCP_PROJECT_ID + VERTEX_REGION |
Required for Gemini on Vertex. us-central1 is the default and has Gemini 2.5 GA. us-east5 is the Anthropic region (used only when a role's provider: anthropic — see D-009) |
ANTHROPIC_BACKEND |
Only matters when a role's provider: anthropic (v0.1: just synthesizer_b, a Phase 2 slot). vertex (default) routes through GCP; direct uses ANTHROPIC_API_KEY |
TAVILY_API_KEY |
Free tier 1k q/mo. Skip with MHFA_SEARCH_PROVIDER=mock for tests |
SEC_USER_AGENT |
SEC blocks requests without contact-info UA — use real name + email |
MHFA_LOCAL_SEC_DIR |
Path to a dir holding summaries/, summaries_10q/, summaries_8k/ (the sister repo's data dump) |
- Plan (
agent/planner.py) — forearnings_recapthe plan is fixed: latest 10-Q + recent 8-Ks + 3-month price + company info + web hits. - Execute (
agent/executor.py) — calls each tool, captures per-tool timing, never crashes on a single tool failure. - Chart — yfinance close prices → matplotlib PNG.
- Synthesize (
agent/synthesizer.py) — Gemini 2.5 Pro gets raw JSON + a hard prompt that requires every numeric claim to trace to a source and forbids estimation (the system rule writes "not disclosed" instead of guessing). Output is markdown with inline citations. - (Optional) Eval (
eval/factuality.py) — Gemini 2.5 Flash-as-judge runs two passes: extract every factual claim from the brief, then verify each againstraw_data. Both passes use Gemini's JSON mode +thinking_budget: 0so the judge stays cheap and parseable. Score = verified / total.
| Metric | Method | v0.1 baseline |
|---|---|---|
| Factuality | Two-pass Gemini Flash (extract claims → verify each against raw_data) | 1.00 (17/17) on eval/runs/NVDA_20260426_factuality.json |
| Comprehensiveness | Rubric (0–3) over 5 axes | Phase 2 |
| BERTScore F1 vs golden | RoBERTa-large, paired t-test | Phase 2 — same metric as sister repo for narrative continuity |
Hand-curated golden briefs live in eval/golden/. They are written from raw
sources by hand, never LLM-generated (decision D-006 — avoids
evaluator/generator collapse).
src/mhfa/
├── tools/ SEC, market data, web search — pluggable
├── agent/ planner + executor + synthesizer
├── models/ client.py — provider-agnostic complete_text adapter
│ (dispatches Gemini ↔ Anthropic per role)
├── workflows/ earnings_recap (more in Phase 2)
├── eval/ factuality (v0.1) + ab_harness (Phase 2)
└── cli.py entry point: `mhfa earnings-recap <TICKER>`
configs/models.yaml role → {provider, model, max_tokens, temperature, …}
eval/golden/ hand-curated reference briefs (D-006)
eval/runs/ pinned eval results (factuality, BERTScore in Phase 2)
tests/ hermetic — fake completion adapter + fake SEC dir
docs/HOW_IT_WORKS.md— long-form walkthrough of the design tensions and how each one resolved (planner shape, Vertex vs direct, layered routing, hand-curated golden briefs, what I'd do differently). Read this if you want to understand why the moving parts are shaped the way they are.DECISIONS.md— D-001 through D-009, terse one-paragraph rationale per architectural choice. D-009 covers the Vertex Anthropic quota Catch-22 and the pivot to Gemini 2.5.SPRINT_PLAN.md— the actual hour-by-hour MVP plan.eval/golden/— hand-curated reference briefs used as the Phase 2 A/B baseline.
See ROADMAP.md for full phasing. Headline:
- v0.1 (this release) —
earnings_recapend-to-end, factuality eval, 3+ golden briefs, CI green. - v0.5 — Multi-Horizon Gemma adapter integration, true A/B with paired-t,
ma_drilldown+sector_compareworkflows. - v1.0 — Streamlit UI, Docker self-host, watchlist cron, cost-aware routing, observability dashboard, public release.
MIT — see LICENSE.
multi-horizon-financial-llm
is the fine-tuning + RAG side: 69 S&P 500 tickers × 381 SEC filings, two
PEFT LoRA adapters (Gemma 2 27B and Gemma 4 31B) trained on TPU v6e-8.
HF Hub: Srx7703/gemma-{2-27b,4-31b}-financial-adapter. The two repos
cross-reference; the agent here will A/B those adapters against Opus in
Phase 2.

