This Streamlit application allows you to compare different SPARQL QA models across various knowledge graphs and benchmarks.
- Navigate to the
apps/evaluationdirectory - Install the required dependencies:
pip install -r requirements.txt - Run the Streamlit app:
streamlit run app.py ../../data/benchmark
The app expects data to be organized in the following format:
[benchmark_dir]/
[knowledge-graph]/ # Name of the knowledge graph
[benchmark]/ # Name of the benchmark
test.jsonl # Benchmark input and ground truth
outputs/
[model].jsonl # Model output
[model].config.json # Model config
[model].evaluation.json # Evaluation against ground truth
...
In this Repo for example:
data/benchmark/
wikidata/
qald10/
test.jsonl
outputs/
qwen-72b.search_extended.jsonl
qwen-72b.search_extended.config.json
qwen-72b.search_extended.evaluation.json
Each line contains a JSON object like this, where info contains arbitrary additional
information:
Each line contains a JSON object like this, with optional additional fields:
{
"id": "sample_0",
"sparql": "SELECT * WHERE { ?s ?p ?o }",
// additional fields here
}Model output evaluated against ground truth, as
produced by the grasp evaluate f1 command, in the format:
{
"sample_0": {
// Ground truth SPARQL query
"target": {
"size": 12345, // Size of ground truth SPARQL results
"err": null, // Error message if error occurred during ground truth SPARQL execution
},
// Model output, can be null if no output was produced
"prediction": {
"sparql": "SELECT * WHERE { ?s ?p ?o }",
"err": null, // Error message if error occurred during predicted SPARQL execution
"size": 12345, // Size of predicted SPARQL results
"score": 0.8, // F1-score between predicted and ground truth SPARQL results
"elapsed": 10.3 // Time taken to predict the SPARQL query by the model
},
// evaluations of more samples here
}
{ "id": "sample_0", "question": "Select everything", "sparql": "SELECT * WHERE { ?s ?p ?o }", "paraphrases": ["Select all triples", "Get all triples"], "info": { // arbitrary additional information here } }