Skip to content

Zero-hallucination OLMo deployment in Brazilian tax law — architectural case study #904

@JonyWolff

Description

@JonyWolff

Hi OLMo team,

I'm sharing a production deployment of OLMo in a highly regulated domain that might be of interest to the community — particularly around hallucination prevention and auditability.

Context

Brazilian tax law is one of the most complex regulatory environments in the world. In January 2025, Brazil enacted LC 214/2025 — a comprehensive tax reform that restructured the entire indirect tax system. Professionals in this domain need AI systems that are 100% reliable: a wrong answer doesn't just look bad — it creates legal and financial liability.

Most LLM-based legal/tax tools rely on prompting or RLHF to reduce hallucinations. The best published results still show significant hallucination rates (industry benchmarks range from 17% to 58%+ depending on the system). We took a fundamentally different approach.

Architecture: "Take the pen away from the LLM"

Our core design principle: the LLM never generates factual content. Instead, it serves two narrow roles:

  1. Intent classification — understanding what the user is asking
  2. Natural language formatting — presenting retrieved facts in conversational Portuguese

All factual content comes from a deterministic retrieval pipeline that combines:

  • Hybrid search (dense + sparse retrieval)
  • A repository of human-approved solutions (case-based reasoning, documented since 1992)
  • Coverage gating: if the system can't cover ≥70% of the query from verified sources, it refuses to answer rather than hallucinate

The LLM is deliberately constrained to a "classifier + formatter" role. It never writes the answer — it only decides which pre-approved answer to surface and how to phrase it.

Results

Metric Value
Hallucination rate 0% (across all test cases)
Entity resolution accuracy 100% (31/31 cases)
Latency 60–200ms
Audit trail Every response includes: source, hash, confidence score, timestamp

Published artifacts

Why I'm sharing this

I believe this approach — using OLMo as a classifier rather than a generator in high-stakes domains — is an underexplored pattern that could be valuable for the OLMo research community. It's also directly relevant to the work on OLMoTrace, since our architecture achieves full traceability by design (every response maps to a verified source, not to training data).

I'd welcome feedback from the team, particularly:

  • Has the OLMo team seen similar "LLM-as-classifier" architectures in other regulated domains?
  • Would this be a useful case study for the OLMo ecosystem documentation?
  • Are there OLMo-specific optimizations for classification-only workloads (vs. generation)?

Happy to share more technical details or do a walkthrough.

Jony Wolff
Kesshet AI Systems — Tel Aviv · São Paulo
jony@kesshet.com.br
[LinkedIn](https://www.linkedin.com/in/jony-wolff-25510839b)


LABELS SUGERIDAS:

discussion community use-case

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions