Skip to content

Commit 0dcf1d5

Browse files
authored
Feature/cn llm eval (#22)
* Add batch processing LLM, cn prompts * Restructure predictors * lettuce multi * first version of the eurobert blogpost * Change blog post and README * Added Openai API key * A bit more stuff to README * Remove Redundancy * Bump version * Bump version in README * Remove emojis * Mini changes in README and EUROBERT * Final changes in blog * Typo * Predictions * Different image * Changed pytest * Fixed tests
1 parent 834f6ab commit 0dcf1d5

20 files changed

Lines changed: 1060 additions & 542 deletions

README.md

Lines changed: 60 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
<br><em>Because even AI needs a reality check! 🥬</em>
77
</p>
88

9-
LettuceDetect is a lightweight and efficient tool for detecting hallucinations in Retrieval-Augmented Generation (RAG) systems. It identifies unsupported parts of an answer by comparing it to the provided context. The tool is trained and evaluated on the [RAGTruth](https://aclanthology.org/2024.acl-long.585/) dataset and leverages [ModernBERT](https://github.com/AnswerDotAI/ModernBERT) for long-context processing, making it ideal for tasks requiring extensive context windows.
9+
LettuceDetect is a lightweight and efficient tool for detecting hallucinations in Retrieval-Augmented Generation (RAG) systems. It identifies unsupported parts of an answer by comparing it to the provided context. The tool is trained and evaluated on the [RAGTruth](https://aclanthology.org/2024.acl-long.585/) dataset and leverages [ModernBERT](https://github.com/AnswerDotAI/ModernBERT) for English and [EuroBERT](https://huggingface.co/blog/EuroBERT/release) for multilingual support, making it ideal for tasks requiring extensive context windows.
1010

1111
Our models are inspired from the [Luna](https://aclanthology.org/2025.coling-industry.34/) paper which is an encoder-based model and uses a similar token-level approach.
1212

@@ -21,17 +21,24 @@ Our models are inspired from the [Luna](https://aclanthology.org/2025.coling-ind
2121
- LettuceDetect addresses two critical limitations of existing hallucination detection models:
2222
- Context window constraints of traditional encoder-based methods
2323
- Computational inefficiency of LLM-based approaches
24-
- Our models currently **outperforms** all other encoder-based and prompt-based models on the RAGTruth dataset and are significantly faster and smaller
24+
- Our models currently **outperform** all other encoder-based and prompt-based models on the RAGTruth dataset and are significantly faster and smaller
2525
- Achieves higher score than some fine-tuned LLMs e.g. LLAMA-2-13B presented in [RAGTruth](https://aclanthology.org/2024.acl-long.585/), coming up just short of the LLM fine-tuned in the [RAG-HAT paper](https://aclanthology.org/2024.emnlp-industry.113.pdf)
26-
- We release the code, the model and the tool under the **MIT license**
26+
27+
## 🚀 Latest Updates
28+
29+
- **May 18, 2025** - Released version **0.1.7**: Multilingual support (thanks to EuroBERT) for 7 languages: English, German, French, Spanish, Italian, Polish, and Chinese!
30+
- Up to **17 F1 points improvement** over baseline LLM judges like GPT-4.1-mini across different languages
31+
- **EuroBERT models**: We've trained base/210M (faster) and large/610M (more accurate) variants
32+
- You can now also use **LLM baselines** for hallucination detection (see below)
2733

2834
## Get going
2935

3036
### Features
3137

3238
-**Token-level precision**: detect exact hallucinated spans
3339
- 🚀 **Optimized for inference**: smaller model size and faster inference
34-
- 🧠 **4K context window** via ModernBERT
40+
- 🧠 **Long context window** support (4K for ModernBERT, 8K for EuroBERT)
41+
- 🌍 **Multilingual support**: 7 languages covered
3542
- ⚖️ **MIT-licensed** models & code
3643
- 🤖 **HF Integration**: one-line model loading
3744
- 📦 **Easy to use python API**: can be downloaded from pip and few lines of code to integrate into your RAG system
@@ -45,25 +52,42 @@ pip install -e .
4552

4653
From pip:
4754
```bash
48-
pip install lettucedetect
55+
pip install lettucedetect -U
4956
```
5057

5158
### Quick Start
5259

5360
Check out our models published to Huggingface:
54-
- lettucedetect-base: https://huggingface.co/KRLabsOrg/lettucedect-base-modernbert-en-v1
55-
- lettucedetect-large: https://huggingface.co/KRLabsOrg/lettucedect-large-modernbert-en-v1
61+
62+
**English Models**:
63+
- Base: [KRLabsOrg/lettucedetect-base-modernbert-en-v1](https://huggingface.co/KRLabsOrg/lettucedetect-base-modernbert-en-v1)
64+
- Large: [KRLabsOrg/lettucedetect-large-modernbert-en-v1](https://huggingface.co/KRLabsOrg/lettucedetect-large-modernbert-en-v1)
65+
66+
**Multilingual Models**:
67+
We've trained 210m and 610m variants of EuroBERT, see our HuggingFace collection: [HF models](https://huggingface.co/collections/KRLabsOrg/multilingual-hallucination-detection-682a2549c18ecd32689231ce)
68+
69+
70+
*See the full list of models and smaller variants in our [HuggingFace page](https://huggingface.co/KRLabsOrg).*
5671

5772
You can get started right away with just a few lines of code.
5873

5974
```python
6075
from lettucedetect.models.inference import HallucinationDetector
6176

62-
# For a transformer-based approach:
77+
# For English:
6378
detector = HallucinationDetector(
64-
method="transformer", model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1"
79+
method="transformer",
80+
model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1",
6581
)
6682

83+
# For other languages (e.g., German):
84+
# detector = HallucinationDetector(
85+
# method="transformer",
86+
# model_path="KRLabsOrg/lettucedect-210m-eurobert-de-v1",
87+
# lang="de",
88+
# trust_remote_code=True
89+
# )
90+
6791
contexts = ["France is a country in Europe. The capital of France is Paris. The population of France is 67 million.",]
6892
question = "What is the capital of France? What is the population of France?"
6993
answer = "The capital of France is Paris. The population of France is 69 million."
@@ -75,26 +99,39 @@ print("Predictions:", predictions)
7599
# Predictions: [{'start': 31, 'end': 71, 'confidence': 0.9944414496421814, 'text': ' The population of France is 69 million.'}]
76100
```
77101

78-
## Performance
102+
Check out our [HF collection](https://huggingface.co/collections/KRLabsOrg/multilingual-hallucination-detection-682a2549c18ecd32689231ce) for more examples.
79103

80-
**Example level results**
104+
We also implemented LLM-based baselines, for that add your OpenAI API key:
81105

82-
We evaluate our model on the test set of the [RAGTruth](https://aclanthology.org/2024.acl-long.585/) dataset. Our large model, **lettucedetect-large-v1**, achieves an overall F1 score of 79.22%, outperforming prompt-based methods like GPT-4 (63.4%) and encoder-based models like [Luna](https://aclanthology.org/2025.coling-industry.34.pdf) (65.4%). It also surpasses fine-tuned LLAMA-2-13B (78.7%) (presented in [RAGTruth](https://aclanthology.org/2024.acl-long.585/)) and is competitive with the SOTA fine-tuned LLAMA-3-8B (83.9%) (presented in the [RAG-HAT paper](https://aclanthology.org/2024.emnlp-industry.113.pdf)). Overall, **lettucedetect-large-v1** and **lettucedect-base-v1** are very performant models, while being very effective in inference settings.
106+
```bash
107+
export OPENAI_API_KEY=your_api_key
108+
```
83109

84-
The results on the example-level can be seen in the table below.
110+
Then in code:
85111

86-
<p align="center">
87-
<img src="https://github.com/KRLabsOrg/LettuceDetect/blob/main/assets/example_level_lettucedetect.png?raw=true" alt="Example-level Results" width="800"/>
88-
</p>
112+
```python
113+
from lettucedetect.models.inference import HallucinationDetector
89114

90-
**Span-level results**
115+
# For German:
116+
detector = HallucinationDetector(method="llm", lang="de")
91117

92-
At the span level, our model achieves the best scores across all data types, significantly outperforming previous models. The results can be seen in the table below. Note that here we don't compare to models, like [RAG-HAT](https://aclanthology.org/2024.emnlp-industry.113.pdf), since they have no span-level evaluation presented.
118+
# Then predict the same way
119+
predictions = detector.predict(context=contexts, question=question, answer=answer, output_format="spans")
120+
```
93121

94-
<p align="center">
95-
<img src="https://github.com/KRLabsOrg/LettuceDetect/blob/main/assets/span_level_lettucedetect.png?raw=true" alt="Span-level Results" width="800"/>
96-
</p>
122+
## Performance
123+
124+
We've evaluated our models against both encoder-based and LLM-based approaches. The key findings include:
97125

126+
- In English, our model **outperform** all other encoder-based and prompt-based models on the RAGTruth dataset and are significantly faster and smaller
127+
- Our multilingual models are better than baseline LLM judges like GPT-4.1-mini
128+
- Our models are also significantly faster and smaller than the LLM-based judges
129+
130+
For detailed performance metrics and evaluations of our models:
131+
- [English model documentation](docs/README.md)
132+
- [Multilingual model documentation](docs/EUROBERT.md)
133+
- [Paper](https://arxiv.org/abs/2502.17125)
134+
- [Model cards](https://huggingface.co/KRLabsOrg)
98135

99136
## How does it work?
100137

@@ -229,11 +266,11 @@ positional arguments:
229266
options:
230267
-h, --help show this help message and exit
231268
--model MODEL Path or huggingface URL to the model. The default value is
232-
"KRLabsOrg/lettucedect-base-modernbert-en-v1".
269+
"KRLabsOrg/lettucedetect-base-modernbert-en-v1".
233270
--method {transformer}
234271
Hallucination detection method. The default value is
235272
"transformer".
236-
````
273+
```
237274

238275
Example using the python client library:
239276

assets/lettuce_detective_multi.png

2.34 MB
Loading

0 commit comments

Comments
 (0)