Name	Name	Last commit message	Last commit date
parent directory ..
RAG_evaluation.ipynb	RAG_evaluation.ipynb
Readme.md	Readme.md
create_legalbenchrag_dataset.ipynb	create_legalbenchrag_dataset.ipynb
create_marker_dataset.ipynb	create_marker_dataset.ipynb
rd_change_rag_prompt.ipynb	rd_change_rag_prompt.ipynb
run_an_example_of_toolings_usage.ipynb	run_an_example_of_toolings_usage.ipynb
run_dataset_experiences_demo.ipynb	run_dataset_experiences_demo.ipynb
run_evals_compliance.ipynb	run_evals_compliance.ipynb
run_evals_for_your_own_IA_system.ipynb	run_evals_for_your_own_IA_system.ipynb
run_evals_models_raw.ipynb	run_evals_models_raw.ipynb
run_evals_models_with_rag.ipynb	run_evals_models_with_rag.ipynb
run_evals_ocr_marker.ipynb	run_evals_ocr_marker.ipynb
run_evals_with_crossvalidation.ipynb	run_evals_with_crossvalidation.ipynb
run_your_own_llm_as_a_judge_metric.ipynb	run_your_own_llm_as_a_judge_metric.ipynb

Name

Last commit message

Last commit date

Readme.md

create_legalbenchrag_dataset.ipynb

create_marker_dataset.ipynb

rd_change_rag_prompt.ipynb

run_an_example_of_toolings_usage.ipynb

run_dataset_experiences_demo.ipynb

run_evals_compliance.ipynb

run_evals_for_your_own_IA_system.ipynb

run_evals_models_raw.ipynb

run_evals_models_with_rag.ipynb

run_evals_ocr_marker.ipynb

run_evals_with_crossvalidation.ipynb

run_your_own_llm_as_a_judge_metric.ipynb

Demo Notebooks

We offer several notebooks designed to demonstrate how to use the API and to provide baseline evaluation results. You can use these as inspiration to develop and propose new and original evaluation experiments.

The available notebooks are:

run_evals_with_crossvalidation: This notebook allows you to perform a series of experiments (Experiment Set) on:
- several LLM models using a specified set of metrics and a repetition parameter.
- multiple RAG-augmented generation LLM models with a given set of metrics and performs a grid search on the “limit” parameters (which refer to the number of block limits in an RAG parameter).
- multiple RAG-augmented generation LLM models with specialized RAG metrics. These RAG metrics use the “search context” (aka chunks) to compute scores.
run_dataset_experiences_demo: This notebook runs individual experiments on the components of the dataset itself.
run_your_own_llm_as_a_judge_metric: This notebook provides an example of how to use a custom metric llm as-a-judge (ad hoc judge). It is based on DECCP is a censorchip evaluation about China related questions inspired by the followinw article : https://huggingface.co/blog/leonardlin/chinese-llm-censorship-analysis
run_evals_compliance: This notebook provides an example of how to run compliance evaluation for your IA system (like social biais, toxicity...).
OCR evaluation: Two notebooks that show how how to use a parquet dataset with images (marker dataset) and run an OCR evaluation on it to test some VLMs (Visual Language Model).
- create_marker_dataset.ipynb
- run_evals_ocr_marker.ipynb
run_an_example_of_toolings_usage: An example of tooling usage.
run_evals_models_raw: Simple evaluation of the raw Albert models.
run_evals_models_with_rag: RAG evaluation of the Albert models.
rd_evalap_is_your_own_ia_system: An exemple of sampling dataset usage, on LegalBenchRAG dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme.md

Demo Notebooks

FilesExpand file tree

notebooks

Directory actions

More options

Directory actions

More options

Latest commit

History

notebooks

Folders and files

parent directory

Readme.md

Demo Notebooks