The Napolab is your go-to collection of Portuguese datasets for the evaluation of Large Language Models.
Medium Article: "The Hidden Truth About LLM Performance: Why Your Benchmark Results Might Be Misleading"
Browse the Napolab Leaderboard and stay up to date with the latest advancements in Portuguese language models.
A format of Napolab specifically designed for researchers experimenting with Large Language Models (LLMs) is now available. This format includes two main fields:
- Prompt: The input prompt to be fed into the LLM.
- Answer: The expected classification output label from the LLM, which is always a number between 0 and 5.
The dataset in this format can be accessed at https://huggingface.co/datasets/ruanchaves/napolab. If you’ve used Napolab for LLM evaluations, please share your findings with us!
Nicholas Kluge, et al. have fine-tuned TeenyTinyLlama models on the FaQUAD-NLI and HateBR datasets from Napolab. For more information, please refer to the article "TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese".
We've made several models, fine-tuned on this benchmark, available on Hugging Face Hub:
| Datasets | mDeBERTa v3 | BERT Large | BERT Base |
|---|---|---|---|
| ASSIN 2 - STS | Link | Link | Link |
| ASSIN 2 - RTE | Link | Link | Link |
| ASSIN - STS | Link | Link | Link |
| ASSIN - RTE | Link | Link | Link |
| HateBR | Link | Link | Link |
| FaQUaD-NLI | Link | Link | Link |
| PorSimplesSent | Link | Link | Link |
For model fine-tuning details and benchmark results, visit EVALUATION.md.
Napolab adopts the following guidelines for the inclusion of datasets:
- 🌿 Natural: As much as possible, datasets consist of natural Portuguese text or professionally translated text.
- ✅ Reliable: Metrics correlate reliably with human judgments (accuracy, F1 score, Pearson correlation, etc.).
- 🌐 Public: Every dataset is available through a public link.
- 👩🔧 Human: Expert human annotations only. No automatic or unreliable annotations.
- 🎓 General: No domain-specific knowledge or advanced preparation is needed to solve dataset tasks.
Napolab currently includes the following datasets:
| assin | assin2 | rerelem |
| hatebr | reli-sa | faquad-nli |
| porsimplessent |
💡 Contribute: We're open to expanding Napolab! Suggest additions in the issues. For more information, read our CONTRIBUTING.md.
🌍 For broader accessibility, all datasets have translations in Catalan, English, Galician and Spanish using the facebook/nllb-200-1.3B model via Easy-Translate.
To reproduce the Napolab benchmark available on the Hugging Face Hub locally, follow these steps:
- Clone the repository and install the library:
git clone https://github.com/ruanchaves/napolab.git
cd napolab
pip install -e .- Generate the benchmark file:
from napolab import export_napolab_benchmark, convert_to_completions_format
input_df = export_napolab_benchmark()
output_df = convert_to_completions_format(input_df)
output_df.reset_index().to_csv("test.csv", index=False)If you would like to cite our work or models, please reference the Master's thesis Lessons Learned from the Evaluation of Portuguese Language Models.
@mastersthesis{chaves2023lessons,
title={Lessons learned from the evaluation of Portuguese language models},
author={Chaves Rodrigues, Ruan},
year={2023},
school={University of Malta},
url={https://www.um.edu.mt/library/oar/handle/123456789/120557}
}
The HateBR dataset, including all its components, is provided strictly for academic and research purposes. The use of the HateBR dataset for any commercial or non-academic purpose is expressly prohibited without the prior written consent of SINCH.
