Much larger RDR file sizes versus reified format #26
Description
The Problem
We have CSV datasets which we are converting to RDF/Turtle using TARQL. Coming to our larger data files we were shocked to find very large file sizes.
For Reified datasets we expected it, but while we expected smaller file sizes for Star, we were thrown off by larger Star files than Reified.
No. of lines:
32534402 ./foobar.reif.ttl
15015877 ./foobar.ttls
2502647 ./foobar.csv
That's 2.5M CSV vs. 15M RDFStar vs. 33M Reified. Reified is 13x CSV and 2x RDFStar, so we can consider this a 2x reduction from Reified to RDFStar. Fair enough and expected.
And now the sizes:
2.1G ./foobar.ttls
1.4G ./foobar.reif.ttl
287M ./foobar.csv
That's 0.3G CSV vs. 1.4G Reified vs. 2.1G RDFStar. Reified is 7x CSV, but RDFStar is now 1.5x Reified -- a net increase from Reified to RDFStar!
The Data
CSV
Every row in this example is a relationship instance set and every column a property on that relationship (i.e. edge properties).
foo_uid,foo_eid,start_date,end_date,title,a_type,b_type,compensation,source_entity_id,source_id,bar_uid,bar_eid,some_ts
76215f08-f66d-40fb-a549-e28ef2aebc9a,3721239,,,Monkey,,,,,,28065e37-6f7b-4d01-96f6-446d6e738fe0,8909756956994228352,
Note: From our line/row counts, we know this file represents 2502646 (2.5M) resources aka objects (the relationship instances together with their attributes/properties).
Reified
Actual (indents from TARQL removed):
ns1:foo.76215f08-f66d-40fb-a549-e28ef2aebc9a ns0:somePred ns1:bar.28065e37-6f7b-4d01-96f6-446d6e738fe0 .
_:b0 rdf:type rdf:Statement ;
rdf:subject ns1:foo.76215f08-f66d-40fb-a549-e28ef2aebc9a ;
rdf:predicate ns0:somePred ;
rdf:object ns1:bar.28065e37-6f7b-4d01-96f6-446d6e738fe0 ;
ns0:foo_uid "76215f08-f66d-40fb-a549-e28ef2aebc9a" ;
ns0:foo_eid "3721239" ;
ns0:title "Monkey" ;
ns0:bar_uid "28065e37-6f7b-4d01-96f6-446d6e738fe0" ;
ns0:bar_eid "8909756956994228352" .
General:
r1 r2 r3
s1 a S
s r1
p r2
o r3
e1 v1
e2 v2
e3 v3
e4 v4
e5 v5
For a relationship instance here there are 12 constant elements and 10 variable elements (2 * E properties), hence (12 + (2 * E)) * R
for R
relationships.
RDR
Actual (indents from RDF2RDFStar in tact):
ns1:foo.76215f08-f66d-40fb-a549-e28ef2aebc9a ns0:somePred ns1:bar.28065e37-6f7b-4d01-96f6-446d6e738fe0 .
<<ns1:foo.76215f08-f66d-40fb-a549-e28ef2aebc9a ns0:somePred ns1:bar.28065e37-6f7b-4d01-96f6-446d6e738fe0>> ns0:foo_uid "76215f08-f66d-40fb-a549-e28ef2aebc9a" ;
ns0:foo_eid "3721239" ;
ns0:title "Monkey" ;
ns0:bar_uid "28065e37-6f7b-4d01-96f6-446d6e738fe0" ;
ns0:bar_eid "8909756956994228352" .
General:
r1 r2 r3
<<r1 r2 r3>> e1 v1
e2 v2
e3 v3
e4 v4
e5 v5
For a relationship instance here, if we consider <<
and >>
unique, there are 8 constant elements and 10 variable elements (2 * E properties), hence (8 + (2 * E)) * R
for R
relationships.
Conclusion
Star is clearly supposed to be more compact in general than the Reified, where the constant no. of triple/statement elements in our example analysis is 12 (vs. 8 for Star), so it doesn't make sense why in any situation this can be bigger.
However, I'm far from a mathematician and there are lots of other variables I may not have considered. Bottomline is that we're stumped!
Activity