Skip to content

hmuhajab/DLIG-KG_with_WorldKG

Repository files navigation

Preprocessing Stage

A) Preprocessing Folder

  1. Extract Instances from Ontology:

    • Extract the instances from the ontology of England and Wales and save them in n-triple files with the type of the class.
    • Input: England.owl and Wales.owl
    • Output: England_triples.nt and Wales_triples.nt

    Example of output:

    Example Output

  2. Second Data Source (DataSource_1_YAGO):

    • Use OS_matches.nt from DataSource_1_YAGO.
    • Example of the data:

    Data Example

  3. Extract YAGO with Geometry Data:

    • Extract YAGO with the geometry data from extend.OS. This is important for geometry distance. The output contains the URI of the entity and the polygon data.
    • Input: OS_extended.ttl
    • Output: OS_extended_geometry.ttl

B) Geometry_with_geomodeluri Folder

This step of preprocessing is to extract the actual URIs of entities from the GeoModel and the geometry points and save them in one file. This involves the following steps:

  1. Step 1: Extract the OS_id and URI of all instances and save them into an n-triple file (for Wales and England).

    • Input: Wales.owl, England.owl
    • Output: instance_geomodel_Wales_OS.nt, instance_geomodel_England_OS.nt

    Example of output:

    Example Output

  2. Step 2: Match the URIs in the geometry folders with the URIs in the .nt files that have OS_id.

    • For example: instance_geomodel_England_OS.nt
    • The output file should contain the URI of geom and the geometry.

    Note: We have geometry folders named GeometryEngland and GeometryWales that contain files including the URI of the OS and the geometry data (these are extracted in a previous task, see GeomtryExtraction_readme.txt).

    • Input 1: Geometry data files such as GeometryEngland/local_england_ced_geometry.json
    • Input 2: The file extracted in step 1 for England and Wales such as instance_geomodel_England_OS.nt
    • Output: File containing geom URI with the geometry points such as GeometryEngland/GeoModel_england_ced_geometry.json

Main Task: Similarity

In the Similarity folder, main_task.py is executed which includes multiple steps:

Label Similarity

  1. Install Dependencies:

    • Install rdflib and jellyfish to perform label similarity on the RDF data.
  2. Jaro-Winkler Similarity:

    • Calculate the similarity score based on the number of matching characters and the transpositions of characters within a certain prefix length.

    • Threshold: label_similarity_threshold = 0.55

    • A) England

      • Input: DataSource_1_YAGO.nt, England_triples.nt
      • Output: matches_England_jaro.csv
      • Number of matched entities for England: 7823615
    • B) Wales

      • Input: DataSource_1_YAGO.nt, Wales_triples.nt
      • Output: matches_Wales_jaro.csv
      • Number of matched entities for Wales: 73013307

Geometry Filter

  1. Load Geometry Data for England and Wales:

    • Load the geometry data into a dictionary.
  2. Coordinate Transformation:

    • Perform coordinate transformation or reprojection for the representation of the geometry data.

    • The geometry point of OS is in projection map (EPSG:27700), while YAGO is EPSG:4326. Convert the OS to the geographical coordinate system (WGS:84) for the geometry distance filter.

    • Input: All the geometry files for England and Wales

    • Output: england_polygons_wgs84.json, wales_polygons_wgs84.json

  3. Load YAGO Geometry Data:

    • Input: OS_extended_geometry.ttl (from step 3 in the preprocessing (A) above)
  4. Apply Geometry Distance Filter:

    • Use Euclidean distance between (YAGO and England) and (YAGO and Wales) with a threshold of 0.2.
    • Output: matches_england_geometry.csv, matches_wales_geometry.csv
  5. Filtered Results:

    • Perform an inner merge on both 'Yago' and 'England' columns and on both 'Yago' and 'Wales' columns.
    • Result: merged_df_wales_matches.json, merged_df_england_matches.json

Some files necessary to run the code are big and available on request: muhajabh@cardiff.ac.uk, abdelmotyai@cardiff.ac.uk

Label semilarity output Geomtry Filter Task_Geometry Data and output

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors