Skip to content

How to map a single dataset containing multiple sources to itself? #255

@KonradHoeffner

Description

@KonradHoeffner

Is it possible to use LIMES with more than two sources which are all included in the same file?
The sources should be mapped to each other but of course I don't want to map a source to itself and I also don't want to have duplicate pairs (A,B) and (B,A).
To clarify with an example, lets say I have a class :Country with many instances and each country has a population of individuals.
All of this data is in the same file countries.ttl.
Now I want to find out, which individuals live in more than one country.

:Germany a :Country;
 rdfs:label "Germany".

:Azerbaijan a :Country;
 rdfs:label "Azerbaijan".

:person123 a :Person;
 rdfs:label "Alex Müller";
 :country :Germany.

:person 456 a :Person;
 rdfs:label "Alex Mueller";
 :country :Azerbaijan.

This can be done in the following manner, declaring source and target alike:

    <SOURCE>
        <ID>c1</ID>
        <ENDPOINT>countries.ttl</ENDPOINT>
        <VAR>?c1</VAR>
        <PAGESIZE>-1</PAGESIZE>
        <RESTRICTION>?c1 a :Person; :country ?x.</RESTRICTION>
        <PROPERTY>rdfs:label AS nolang->lowercase->regularalphabet RENAME label</PROPERTY>
        <TYPE>TURTLE</TYPE>
    </SOURCE>

    <TARGET>
        <ID>c2</ID>
        <ENDPOINT>countries.ttl</ENDPOINT>
        <VAR>?c2</VAR>
        <PAGESIZE>-1</PAGESIZE>
        <RESTRICTION>?c2 a :Person; :country ?y.</RESTRICTION>
        <PROPERTY>rdfs:label AS nolang->lowercase->regularalphabet RENAME label</PROPERTY>
        <TYPE>TURTLE</TYPE>
    </TARGET>

   <METRIC>trigrams(c1.label,c2.label)</METRIC>

However this will generate a false match for every person to itself, and also it will also match each pair twice in both directions.
I would like to add a restriction like "STR(?x) < STR(?y)" but it seems like one cannot reference variables from the source in the restriction of the target.
A workaround is to throw away all matches with score exactly 1.0 but this is wasteful on resources and also discards correct matches that happen to be exactly equal.
Also, this will map people in a country to others in the same country which is not intended.

    <ACCEPTANCE>
        <THRESHOLD>1</THRESHOLD>
        <FILE>exact.ttl</FILE>
        <RELATION>owl:sameAs</RELATION>
    </ACCEPTANCE>
    
    <REVIEW>
        <THRESHOLD>0.8</THRESHOLD>
        <FILE>close.ttl</FILE>
        <RELATION>owl:sameAs</RELATION>
    </REVIEW>

Another way is to perform postprocessing to remove all duplicate and self matches but that seems to be inefficient in both developer and execution time.

Lastly, I could write a script which would enumerate all n*(n-1)/2 unique non self-matching pairs and generate as many limes configuration files but that has its own problems.

Is there any way to solve this task efficiently using LIMES or do I need to use one of the mentioned imperfect options?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions