Skip to content

Latest commit

 

History

History
94 lines (76 loc) · 2.5 KB

File metadata and controls

94 lines (76 loc) · 2.5 KB

SPARQL QA Evaluation

This Streamlit application allows you to compare different SPARQL QA models across various knowledge graphs and benchmarks.

Usage

  1. Navigate to the apps/evaluation directory
  2. Install the required dependencies: pip install -r requirements.txt
  3. Run the Streamlit app: streamlit run app.py ../../data/benchmark

Directory structure

The app expects data to be organized in the following format:

[benchmark_dir]/
  [knowledge-graph]/            # Name of the knowledge graph
    [benchmark]/                # Name of the benchmark
      test.jsonl                # Benchmark input and ground truth
      outputs/
        [model].jsonl           # Model output
        [model].config.json     # Model config
        [model].evaluation.json # Evaluation against ground truth
        ...

In this Repo for example:

data/benchmark/
  wikidata/
    qald10/
      test.jsonl
      outputs/
        qwen-72b.search_extended.jsonl
        qwen-72b.search_extended.config.json
        qwen-72b.search_extended.evaluation.json

Expected file formats

Ground truth (test.jsonl)

Each line contains a JSON object like this, where info contains arbitrary additional information:

{
  "id": "sample_0",
  "question": "Select everything",
  "sparql": "SELECT * WHERE { ?s ?p ?o }",
  "paraphrases": ["Select all triples", "Get all triples"],
  "info": {
    // arbitrary additional information here
  }
}

Model output (outputs/*.jsonl)

Each line contains a JSON object like this, with optional additional fields:

{
  "id": "sample_0",
  "sparql": "SELECT * WHERE { ?s ?p ?o }",
  // additional fields here
}

Evaluation (outputs/*.evaluation.json)

Model output evaluated against ground truth, as produced by the grasp evaluate f1 command, in the format:

{
  "sample_0": {
    // Ground truth SPARQL query
    "target": {
      "size": 12345, // Size of ground truth SPARQL results
      "err": null, // Error message if error occurred during ground truth SPARQL execution
    },
    // Model output, can be null if no output was produced
    "prediction": {
      "sparql": "SELECT * WHERE { ?s ?p ?o }",
      "err": null, // Error message if error occurred during predicted SPARQL execution
      "size": 12345, // Size of predicted SPARQL results
      "score": 0.8, // F1-score between predicted and ground truth SPARQL results
      "elapsed": 10.3 // Time taken to predict the SPARQL query by the model
  },
  // evaluations of more samples here
}