This tool provides a flexible command-line tool to generate relevance datasets for search evaluation. It can retrieve documents from a search engine, generate synthetic queries, and score the relevance of document-query pairs using LLMs.
Create a config.yaml file in the dataset-generator directory. This file controls the entire generation process. The parameters needed are:
- query_template (Optional): Template file given by the user to evaluate the search system using relevance judgements generated by the dataset-generator module (e.g., "resources/template_solr.json")
- search_engine_type: Type of search engine to use
- accepted values:
- "solr"
- "elasticsearch"
- "opensearch"
- collection_name: Name of the search engine index/collection (e.g., "testcore", the one used in Docker containers)
- search_engine_url: URL of the search engine (e.g., "http://localhost:8983/solr/")
- documents_filter: Filter query to restrict the set of documents used to generate queries. If a field has more than one value, the documents retrieved have at least one of the values in the field (OR-like). If there are more than one fields, the documents are filtered for both fields (AND-like)
- and example could be:
- genre:
- "horror"
- "fantasy"
- type:
- "book"
- doc_number: Maximum number of documents to retrieve from the search engine to generate queries (e.g. 100)
- doc_fields: List of fields from documents used to generate queries and relevance scoring. These should match fields available in the search engine schema and must be at least one field
- an example could be:
- "title"
- "description"
- queries (Optional): File containing predefined queries to use instead of or alongside generated ones. If available, must be in a txt format (e.g., "queries.txt")
- generate_queries_from_documents (Optional): Whether to generate queries from documents. Default to
true; set tofalseto disable query generation from documents- num_queries_needed: Total number of queries to generate, including predefined queries, if any (e.g., 20)
- relevance_scale: Relevance scale used for scoring document relevance
- accepted values: "binary" or "graded", where
- binary: 0 (not relevant), 1 (relevant)
- graded: 0 (not relevant), 1 (maybe ok), 2 (that’s my result)
- llm_configuration_file: Path to the LLM configuration file (e.g., "dataset-generator/llm_config.yaml")
- output_format: Output format for the generated dataset
- accepted values:
- "quepid"
- "rre"
- "mteb"
- output_destination: Path where the output dataset will be saved (e.g., "resources")
- save_llm_explanation: Whether to save LLM rating score explanation to file. Defaults to
false- llm_explanation_destination (Needed only if
save_llm_explanation: true): File path where it contains <query, doc_id, rating, explanation> records (e.g., "resources/rating_explanation.json")- datastore_autosave_every_n_updates (Optional): Number of successful updates (adds or ratings) after which
- the in-memory datastore is saved. If not given, the datastore is saved at the end of the process.
If we set output_format: "rre" we need to add other parameters to the configuration file.
- id_field: id field used by the search engine (e.g., "id")
- rre_query_placeholder: Query placeholder used in RRE for logging purposes (e.g., "$query", this should be kept set to "$query" as much as possible)
- rre_query_template (Optional): This is the file that will be used by RRE (if invoked) to run evaluation. If not filled, is the same as query_template (e.g., "resources/only_vector.json")
More than that, output dataset will be saved into a file named ratings.json in the output_destination directory.
If we set output_format: "quepid", output dataset will be saved into a file named quepid.csv in the
output_destination directory.
If we set output_format: "mteb", output dataset will be saved into 3 different files in the output_destination
directory:
corpus.jsonl: contains <id,title,text> corpus records extracted from search engine;queries.jsonl: contains <id,text> query records LLM-generated and/or user-defined;candidates.jsonl: contains <query_id,doc_id,rating> candidate records.
Fill LLM configuration file with your information and create the .env file in dataset-generator
folder with your own API key.
Place yourself in the main directory (rre-tools/) and create a .env file to store the API key:
echo "OPENAI_API_KEY=<your-key-here>" > dataset-generator/.envThe parameters needed are:
- name: The name of the provider
- accepted providers:
- openai
- gemini
- model: Chat model name of the chosen provider
- max_tokens: An integer indicating the maximum number of token for the generation process
- api_key_env: Same that you used in the .env file (e.g.,
OPENAI_API_KEYorGOOGLE_API_KEY)