We offer several notebooks designed to demonstrate how to use the API and to provide baseline evaluation results. You can use these as inspiration to develop and propose new and original evaluation experiments.
The available notebooks are:
-
run_evals_with_crossvalidation: This notebook allows you to perform a series of experiments (Experiment Set) on:
- several LLM models using a specified set of metrics and a repetition parameter.
- multiple RAG-augmented generation LLM models with a given set of metrics and performs a grid search on the “limit” parameters (which refer to the number of block limits in an RAG parameter).
- multiple RAG-augmented generation LLM models with specialized RAG metrics. These RAG metrics use the “search context” (aka chunks) to compute scores.
-
run_dataset_experiences_demo: This notebook runs individual experiments on the components of the
dataset itself. -
run_your_own_llm_as_a_judge_metric: This notebook provides an example of how to use a
custom metric llm as-a-judge(ad hoc judge). It is based on DECCP is a censorchip evaluation about China related questions inspired by the followinw article : https://huggingface.co/blog/leonardlin/chinese-llm-censorship-analysis -
run_evals_compliance: This notebook provides an example of how to run
complianceevaluation for your IA system (like social biais, toxicity...). -
OCR evaluation: Two notebooks that show how how to use a parquet dataset with images (marker dataset) and run an
OCR evaluationon it to test some VLMs (Visual Language Model). -
run_an_example_of_toolings_usage: An example of tooling usage.
-
run_evals_models_raw: Simple evaluation of the raw Albert models.
-
run_evals_models_with_rag: RAG evaluation of the Albert models.
-
rd_evalap_is_your_own_ia_system: An exemple of
sampling datasetusage, on LegalBenchRAG dataset.