Skip to content
This repository was archived by the owner on Sep 28, 2022. It is now read-only.
This repository was archived by the owner on Sep 28, 2022. It is now read-only.

Much larger RDR file sizes versus reified format #26

Open
@schivmeister

Description

@schivmeister

The Problem

We have CSV datasets which we are converting to RDF/Turtle using TARQL. Coming to our larger data files we were shocked to find very large file sizes.

For Reified datasets we expected it, but while we expected smaller file sizes for Star, we were thrown off by larger Star files than Reified.

No. of lines:

  32534402 ./foobar.reif.ttl
  15015877 ./foobar.ttls
   2502647 ./foobar.csv

That's 2.5M CSV vs. 15M RDFStar vs. 33M Reified. Reified is 13x CSV and 2x RDFStar, so we can consider this a 2x reduction from Reified to RDFStar. Fair enough and expected.

And now the sizes:

2.1G ./foobar.ttls
1.4G ./foobar.reif.ttl
287M ./foobar.csv

That's 0.3G CSV vs. 1.4G Reified vs. 2.1G RDFStar. Reified is 7x CSV, but RDFStar is now 1.5x Reified -- a net increase from Reified to RDFStar!

The Data

CSV

Every row in this example is a relationship instance set and every column a property on that relationship (i.e. edge properties).

foo_uid,foo_eid,start_date,end_date,title,a_type,b_type,compensation,source_entity_id,source_id,bar_uid,bar_eid,some_ts
76215f08-f66d-40fb-a549-e28ef2aebc9a,3721239,,,Monkey,,,,,,28065e37-6f7b-4d01-96f6-446d6e738fe0,8909756956994228352,

Note: From our line/row counts, we know this file represents 2502646 (2.5M) resources aka objects (the relationship instances together with their attributes/properties).

Reified

Actual (indents from TARQL removed):

ns1:foo.76215f08-f66d-40fb-a549-e28ef2aebc9a ns0:somePred ns1:bar.28065e37-6f7b-4d01-96f6-446d6e738fe0 .

_:b0 rdf:type rdf:Statement ;
       rdf:subject ns1:foo.76215f08-f66d-40fb-a549-e28ef2aebc9a ;
       rdf:predicate ns0:somePred ;
       rdf:object ns1:bar.28065e37-6f7b-4d01-96f6-446d6e738fe0 ;
       ns0:foo_uid "76215f08-f66d-40fb-a549-e28ef2aebc9a" ;
       ns0:foo_eid "3721239" ;
       ns0:title "Monkey" ;
       ns0:bar_uid   "28065e37-6f7b-4d01-96f6-446d6e738fe0" ;
       ns0:bar_eid   "8909756956994228352" .

General:

r1 r2 r3
s1 a S
    s r1
    p r2
    o r3
    e1 v1
    e2 v2
    e3 v3
    e4 v4
    e5 v5

For a relationship instance here there are 12 constant elements and 10 variable elements (2 * E properties), hence (12 + (2 * E)) * R for R relationships.

RDR

Actual (indents from RDF2RDFStar in tact):

ns1:foo.76215f08-f66d-40fb-a549-e28ef2aebc9a ns0:somePred ns1:bar.28065e37-6f7b-4d01-96f6-446d6e738fe0 .
<<ns1:foo.76215f08-f66d-40fb-a549-e28ef2aebc9a ns0:somePred ns1:bar.28065e37-6f7b-4d01-96f6-446d6e738fe0>> ns0:foo_uid "76215f08-f66d-40fb-a549-e28ef2aebc9a" ;
                                                                                                                   ns0:foo_eid "3721239" ;
                                                                                                                   ns0:title "Monkey" ;
                                                                                                                   ns0:bar_uid "28065e37-6f7b-4d01-96f6-446d6e738fe0" ;
                                                                                                                   ns0:bar_eid "8909756956994228352" .

General:

r1 r2 r3
<<r1 r2 r3>> e1 v1
             e2 v2
             e3 v3
             e4 v4
             e5 v5

For a relationship instance here, if we consider << and >> unique, there are 8 constant elements and 10 variable elements (2 * E properties), hence (8 + (2 * E)) * R for R relationships.

Conclusion

Star is clearly supposed to be more compact in general than the Reified, where the constant no. of triple/statement elements in our example analysis is 12 (vs. 8 for Star), so it doesn't make sense why in any situation this can be bigger.

However, I'm far from a mathematician and there are lots of other variables I may not have considered. Bottomline is that we're stumped!

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions