Dataset Generator (DAGE)

This tool provides a flexible command-line tool to generate relevance datasets for search evaluation. It can retrieve documents from a search engine, generate synthetic queries, and score the relevance of document-query pairs using LLMs.

Setup configuration files

DAGE configuration file

Create a config.yaml file in the dataset-generator directory. This file controls the entire generation process. The parameters needed are:

query_template (Optional): Template file given by the user to evaluate the search system using relevance judgements generated by the dataset-generator module (e.g., "resources/template_solr.json")

search_engine_type: Type of search engine to use

accepted values:

"solr"

"elasticsearch"

"opensearch"

collection_name: Name of the search engine index/collection (e.g., "testcore", the one used in Docker containers)

search_engine_url: URL of the search engine (e.g., "http://localhost:8983/solr/")

documents_filter: Filter query to restrict the set of documents used to generate queries. If a field has more than one value, the documents retrieved have at least one of the values in the field (OR-like). If there are more than one fields, the documents are filtered for both fields (AND-like)

and example could be:

genre:

"horror"

"fantasy"

type:

"book"

doc_number: Maximum number of documents to retrieve from the search engine to generate queries (e.g. 100)

doc_fields: List of fields from documents used to generate queries and relevance scoring. These should match fields available in the search engine schema and must be at least one field

an example could be:

"title"

"description"

queries (Optional): File containing predefined queries to use instead of or alongside generated ones. If available, must be in a txt format (e.g., "queries.txt")

generate_queries_from_documents (Optional): Whether to generate queries from documents. Default to true; set to false to disable query generation from documents

num_queries_needed: Total number of queries to generate, including predefined queries, if any (e.g., 20)

relevance_scale: Relevance scale used for scoring document relevance

accepted values: "binary" or "graded", where

binary: 0 (not relevant), 1 (relevant)

graded: 0 (not relevant), 1 (maybe ok), 2 (that’s my result)

llm_configuration_file: Path to the LLM configuration file (e.g., "dataset-generator/llm_config.yaml")

output_format: Output format for the generated dataset

accepted values:

"quepid"

"rre"

"mteb"

output_destination: Path where the output dataset will be saved (e.g., "resources")

save_llm_explanation: Whether to save LLM rating score explanation to file. Defaults to false

llm_explanation_destination (Needed only if save_llm_explanation: true): File path where it contains <query, doc_id, rating, explanation> records (e.g., "resources/rating_explanation.json")

datastore_autosave_every_n_updates (Optional): Number of successful updates (adds or ratings) after which

the in-memory datastore is saved. If not given, the datastore is saved at the end of the process.

Some important things to add

If we set output_format: "rre" we need to add other parameters to the configuration file.

id_field: id field used by the search engine (e.g., "id")

rre_query_placeholder: Query placeholder used in RRE for logging purposes (e.g., "$query", this should be kept set to "$query" as much as possible)

rre_query_template (Optional): This is the file that will be used by RRE (if invoked) to run evaluation. If not filled, is the same as query_template (e.g., "resources/only_vector.json")

More than that, output dataset will be saved into a file named ratings.json in the output_destination directory.

If we set output_format: "quepid", output dataset will be saved into a file named quepid.csv in the output_destination directory.

If we set output_format: "mteb", output dataset will be saved into 3 different files in the output_destination directory:

corpus.jsonl: contains <id,title,text> corpus records extracted from search engine;
queries.jsonl: contains <id,text> query records LLM-generated and/or user-defined;
candidates.jsonl: contains <query_id,doc_id,rating> candidate records.

LLM configuration file

Fill LLM configuration file with your information and create the .env file in dataset-generator folder with your own API key.

Place yourself in the main directory (rre-tools/) and create a .env file to store the API key:

echo "OPENAI_API_KEY=<your-key-here>" > dataset-generator/.env

The parameters needed are:

name: The name of the provider

accepted providers:

openai

gemini

model: Chat model name of the chosen provider

max_tokens: An integer indicating the maximum number of token for the generation process

api_key_env: Same that you used in the .env file (e.g., OPENAI_API_KEY or GOOGLE_API_KEY)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Generator (DAGE)

Setup configuration files

DAGE configuration file

Some important things to add

LLM configuration file

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Dataset Generator (DAGE)

Setup configuration files

DAGE configuration file

Some important things to add

LLM configuration file