Hi OLMo team,
I'm sharing a production deployment of OLMo in a highly regulated domain that might be of interest to the community — particularly around hallucination prevention and auditability.
Context
Brazilian tax law is one of the most complex regulatory environments in the world. In January 2025, Brazil enacted LC 214/2025 — a comprehensive tax reform that restructured the entire indirect tax system. Professionals in this domain need AI systems that are 100% reliable: a wrong answer doesn't just look bad — it creates legal and financial liability.
Most LLM-based legal/tax tools rely on prompting or RLHF to reduce hallucinations. The best published results still show significant hallucination rates (industry benchmarks range from 17% to 58%+ depending on the system). We took a fundamentally different approach.
Architecture: "Take the pen away from the LLM"
Our core design principle: the LLM never generates factual content. Instead, it serves two narrow roles:
- Intent classification — understanding what the user is asking
- Natural language formatting — presenting retrieved facts in conversational Portuguese
All factual content comes from a deterministic retrieval pipeline that combines:
- Hybrid search (dense + sparse retrieval)
- A repository of human-approved solutions (case-based reasoning, documented since 1992)
- Coverage gating: if the system can't cover ≥70% of the query from verified sources, it refuses to answer rather than hallucinate
The LLM is deliberately constrained to a "classifier + formatter" role. It never writes the answer — it only decides which pre-approved answer to surface and how to phrase it.
Results
| Metric |
Value |
| Hallucination rate |
0% (across all test cases) |
| Entity resolution accuracy |
100% (31/31 cases) |
| Latency |
60–200ms |
| Audit trail |
Every response includes: source, hash, confidence score, timestamp |
Published artifacts
Why I'm sharing this
I believe this approach — using OLMo as a classifier rather than a generator in high-stakes domains — is an underexplored pattern that could be valuable for the OLMo research community. It's also directly relevant to the work on OLMoTrace, since our architecture achieves full traceability by design (every response maps to a verified source, not to training data).
I'd welcome feedback from the team, particularly:
- Has the OLMo team seen similar "LLM-as-classifier" architectures in other regulated domains?
- Would this be a useful case study for the OLMo ecosystem documentation?
- Are there OLMo-specific optimizations for classification-only workloads (vs. generation)?
Happy to share more technical details or do a walkthrough.
Jony Wolff
Kesshet AI Systems — Tel Aviv · São Paulo
jony@kesshet.com.br
[LinkedIn](https://www.linkedin.com/in/jony-wolff-25510839b)
LABELS SUGERIDAS:
discussion community use-case
Hi OLMo team,
I'm sharing a production deployment of OLMo in a highly regulated domain that might be of interest to the community — particularly around hallucination prevention and auditability.
Context
Brazilian tax law is one of the most complex regulatory environments in the world. In January 2025, Brazil enacted LC 214/2025 — a comprehensive tax reform that restructured the entire indirect tax system. Professionals in this domain need AI systems that are 100% reliable: a wrong answer doesn't just look bad — it creates legal and financial liability.
Most LLM-based legal/tax tools rely on prompting or RLHF to reduce hallucinations. The best published results still show significant hallucination rates (industry benchmarks range from 17% to 58%+ depending on the system). We took a fundamentally different approach.
Architecture: "Take the pen away from the LLM"
Our core design principle: the LLM never generates factual content. Instead, it serves two narrow roles:
All factual content comes from a deterministic retrieval pipeline that combines:
The LLM is deliberately constrained to a "classifier + formatter" role. It never writes the answer — it only decides which pre-approved answer to surface and how to phrase it.
Results
Published artifacts
Why I'm sharing this
I believe this approach — using OLMo as a classifier rather than a generator in high-stakes domains — is an underexplored pattern that could be valuable for the OLMo research community. It's also directly relevant to the work on OLMoTrace, since our architecture achieves full traceability by design (every response maps to a verified source, not to training data).
I'd welcome feedback from the team, particularly:
Happy to share more technical details or do a walkthrough.
Jony Wolff
Kesshet AI Systems — Tel Aviv · São Paulo
jony@kesshet.com.br
[LinkedIn](https://www.linkedin.com/in/jony-wolff-25510839b)
LABELS SUGERIDAS:
discussioncommunityuse-case