Skip to content

Latest commit

 

History

History
96 lines (82 loc) · 5.19 KB

File metadata and controls

96 lines (82 loc) · 5.19 KB

Dataset Generator (DAGE)

This tool provides a flexible command-line tool to generate relevance datasets for search evaluation. It can retrieve documents from a search engine, generate synthetic queries, and score the relevance of document-query pairs using LLMs.

Setup configuration files

DAGE configuration file

Create a config.yaml file in the dataset-generator directory. This file controls the entire generation process. The parameters needed are:

  • query_template (Optional): Template file given by the user to evaluate the search system using relevance judgements generated by the dataset-generator module (e.g., "resources/template_solr.json")
  • search_engine_type: Type of search engine to use
    • accepted values:
      • "solr"
      • "elasticsearch"
      • "opensearch"
  • collection_name: Name of the search engine index/collection (e.g., "testcore", the one used in Docker containers)
  • search_engine_url: URL of the search engine (e.g., "http://localhost:8983/solr/")
  • documents_filter: Filter query to restrict the set of documents used to generate queries. If a field has more than one value, the documents retrieved have at least one of the values in the field (OR-like). If there are more than one fields, the documents are filtered for both fields (AND-like)
    • and example could be:
      • genre:
        • "horror"
        • "fantasy"
      • type:
        • "book"
  • doc_number: Maximum number of documents to retrieve from the search engine to generate queries (e.g. 100)
  • doc_fields: List of fields from documents used to generate queries and relevance scoring. These should match fields available in the search engine schema and must be at least one field
    • an example could be:
      • "title"
      • "description"
  • queries (Optional): File containing predefined queries to use instead of or alongside generated ones. If available, must be in a txt format (e.g., "queries.txt")
  • generate_queries_from_documents (Optional): Whether to generate queries from documents. Default to true; set to false to disable query generation from documents
  • num_queries_needed: Total number of queries to generate, including predefined queries, if any (e.g., 20)
  • relevance_scale: Relevance scale used for scoring document relevance
    • accepted values: "binary" or "graded", where
      • binary: 0 (not relevant), 1 (relevant)
      • graded: 0 (not relevant), 1 (maybe ok), 2 (that’s my result)
  • llm_configuration_file: Path to the LLM configuration file (e.g., "dataset-generator/llm_config.yaml")
  • output_format: Output format for the generated dataset
    • accepted values:
      • "quepid"
      • "rre"
      • "mteb"
  • output_destination: Path where the output dataset will be saved (e.g., "resources")
  • save_llm_explanation: Whether to save LLM rating score explanation to file. Defaults to false
  • llm_explanation_destination (Needed only if save_llm_explanation: true): File path where it contains <query, doc_id, rating, explanation> records (e.g., "resources/rating_explanation.json")
  • datastore_autosave_every_n_updates (Optional): Number of successful updates (adds or ratings) after which
  • the in-memory datastore is saved. If not given, the datastore is saved at the end of the process.

Some important things to add

If we set output_format: "rre" we need to add other parameters to the configuration file.

  • id_field: id field used by the search engine (e.g., "id")
  • rre_query_placeholder: Query placeholder used in RRE for logging purposes (e.g., "$query", this should be kept set to "$query" as much as possible)
  • rre_query_template (Optional): This is the file that will be used by RRE (if invoked) to run evaluation. If not filled, is the same as query_template (e.g., "resources/only_vector.json")

More than that, output dataset will be saved into a file named ratings.json in the output_destination directory.

If we set output_format: "quepid", output dataset will be saved into a file named quepid.csv in the output_destination directory.

If we set output_format: "mteb", output dataset will be saved into 3 different files in the output_destination directory:

  • corpus.jsonl: contains <id,title,text> corpus records extracted from search engine;
  • queries.jsonl: contains <id,text> query records LLM-generated and/or user-defined;
  • candidates.jsonl: contains <query_id,doc_id,rating> candidate records.

LLM configuration file

Fill LLM configuration file with your information and create the .env file in dataset-generator folder with your own API key.

Place yourself in the main directory (rre-tools/) and create a .env file to store the API key:

echo "OPENAI_API_KEY=<your-key-here>" > dataset-generator/.env

The parameters needed are:

  • name: The name of the provider
    • accepted providers:
      • openai
      • gemini
  • model: Chat model name of the chosen provider
  • max_tokens: An integer indicating the maximum number of token for the generation process
  • api_key_env: Same that you used in the .env file (e.g., OPENAI_API_KEY or GOOGLE_API_KEY)