Documentation of a text-to-knowledge graph pipeline using an LLM (Llama-3.1-8B-Instruct deployed though Hugging Face) for ontology population, and its related evaluation strategy.
The evaluation of this experiment are presented in Boscariol M, Meschini S, Tagliabue LC (2025), "Knowledge engineering with LLMs for asset information management in the built environment". Proceedings of the Institution of Civil Engineers - Smart Infrastructure and Construction https://doi.org/10.1680/jsmic.24.00035
This project explores how Large Language Models (LLMs) can automate the creation of Knowledge Graphs (KGs) from unstructured documents in the built environment domain. It introduces a pipeline that extracts entities and relationships using guided few-shot prompting, outputs Terse RDF Triple Language (Turtle) data, and aligns results with existing domain ontologies. The approach compares inference at different text granularities (full text, paragraph, and sentence) to evaluate semantic alignment and syntactic consistency. The experiment was conducted on a use case from the University of Torino, involving requirements for the renovation of a museum. Early results show the method’s potential to support asset information management, improve data structuring, and advance knowledge engineering in construction and facility management contexts.
.
├── python scripts llama/ # Main LLM-based text-to-KG scripts
│ ├── llama-text-to-KG-*.py # Three different granularity approaches
│ └── error analysis/ # Scripts for analyzing and evaluating results
│ └── tables/ # Output tables from analysis
├── input/ # Input data directory
└── results/ # Output RDF files directory
Located in python scripts llama/:
llama-text-to-KG-full-text.py: Processes the entire text as a single inputllama-text-to-KG-paragraphs.py: Processes the text paragraph by paragraphllama-text-to-KG-sentence.py: Processes the text sentence by sentence
These scripts take input from the input/ directory and generate RDF triples in the results/ directory.
Located in python scripts llama/error analysis/. Run these scripts in the following order:
01_ttl-json-df.py: Converts TTL files to JSON format for analysis02_gt-json-df.py: Processes ground truth data into comparable format03_fp-gt-similarity.py: Analyzes similarity between false positives and ground truth03_fp-vs-gt.py: Compares false positives against ground truth data04_tables.py: Generates final analysis tables in thetables/directory
example.txt: Example input text file, to be fed to the system with a few-shot prompting strategyinput.txt: Main input text file, used to run the experimentground1.ttl: Ground truth RDF data for evaluation, referring to the content ofinput.txt
Contains the output files for each granularity approach (full text, paragraph, sentence):
- Complete RDF output (
rdf_output_*.ttl) And all related metrics, per each inference scenario: - True positives (
true_positive_*.ttl) - False positives (
false_positive_*.ttl) - False negatives (
false_negative_*.ttl)
The repository currently contains the input files and output results related to the case study of the experiment. The input/ directory includes the example and input texts used in the study, along with their ground truth RDF data. The results/ directory and error analysis/tables/ contain all the generated outputs, evaluation metrics, and analysis tables from this specific experiment.
To test this pipeline on your input documents:
- Place your input text in
input/input.txt - Prepare a ground truth in TTL syntax of your input text in
input/ground1.ttl - Adapt the example to your use case, providing the ontology concepts required in
input/example.txt - Run one of the text-to-KG scripts:
python "python scripts llama/llama-text-to-KG-full-text.py" # or python "python scripts llama/llama-text-to-KG-paragraphs.py" # or python "python scripts llama/llama-text-to-KG-sentence.py"
- Run the error analysis scripts in sequence:
cd "python scripts llama/error analysis" python 01_ttl-json-df.py python 02_gt-json-df.py python 03_fp-gt-similarity.py python 03_fp-vs-gt.py python 04_tables.py
The final analysis results will be available in the error analysis/tables/ directory.