Much larger RDR file sizes versus reified format

# The Problem

We have CSV datasets which we are converting to RDF/Turtle using TARQL. Coming to our larger data files we were shocked to find very large file sizes.

For Reified datasets we expected it, but while we expected smaller file sizes for Star, we were thrown off by _larger_ Star files than Reified.

No. of lines:

```
  32534402 ./foobar.reif.ttl
  15015877 ./foobar.ttls
   2502647 ./foobar.csv
```

That's 2.5M CSV vs. 15M RDFStar vs. 33M Reified. Reified is 13x CSV and 2x RDFStar, so we can consider this a 2x _reduction_ from Reified to RDFStar. Fair enough and expected.

And now the sizes:

```
2.1G ./foobar.ttls
1.4G ./foobar.reif.ttl
287M ./foobar.csv
```

That's 0.3G CSV vs. 1.4G Reified vs. 2.1G RDFStar. Reified is 7x CSV, but RDFStar is now 1.5x Reified -- a net _increase_ from Reified to RDFStar!

# The Data

## CSV

Every row in this example is a relationship instance set and every column a property on that relationship (i.e. edge properties).

```
foo_uid,foo_eid,start_date,end_date,title,a_type,b_type,compensation,source_entity_id,source_id,bar_uid,bar_eid,some_ts
76215f08-f66d-40fb-a549-e28ef2aebc9a,3721239,,,Monkey,,,,,,28065e37-6f7b-4d01-96f6-446d6e738fe0,8909756956994228352,
```

Note: From our line/row counts, we know this file represents 2502646 (2.5M) resources aka objects (the relationship instances together with their attributes/properties).

## Reified

Actual  (indents from TARQL removed):

```
ns1:foo.76215f08-f66d-40fb-a549-e28ef2aebc9a ns0:somePred ns1:bar.28065e37-6f7b-4d01-96f6-446d6e738fe0 .

_:b0 rdf:type rdf:Statement ;
       rdf:subject ns1:foo.76215f08-f66d-40fb-a549-e28ef2aebc9a ;
       rdf:predicate ns0:somePred ;
       rdf:object ns1:bar.28065e37-6f7b-4d01-96f6-446d6e738fe0 ;
       ns0:foo_uid "76215f08-f66d-40fb-a549-e28ef2aebc9a" ;
       ns0:foo_eid "3721239" ;
       ns0:title "Monkey" ;
       ns0:bar_uid   "28065e37-6f7b-4d01-96f6-446d6e738fe0" ;
       ns0:bar_eid   "8909756956994228352" .
```

General:

```
r1 r2 r3
s1 a S
    s r1
    p r2
    o r3
    e1 v1
    e2 v2
    e3 v3
    e4 v4
    e5 v5
```

For a relationship instance here there are 12 constant elements and 10 variable elements (2 * E properties), hence `(12 + (2 * E)) * R` for `R` relationships.

## RDR

Actual (indents from RDF2RDFStar in tact):

```
ns1:foo.76215f08-f66d-40fb-a549-e28ef2aebc9a ns0:somePred ns1:bar.28065e37-6f7b-4d01-96f6-446d6e738fe0 .
<<ns1:foo.76215f08-f66d-40fb-a549-e28ef2aebc9a ns0:somePred ns1:bar.28065e37-6f7b-4d01-96f6-446d6e738fe0>> ns0:foo_uid "76215f08-f66d-40fb-a549-e28ef2aebc9a" ;
                                                                                                                   ns0:foo_eid "3721239" ;
                                                                                                                   ns0:title "Monkey" ;
                                                                                                                   ns0:bar_uid "28065e37-6f7b-4d01-96f6-446d6e738fe0" ;
                                                                                                                   ns0:bar_eid "8909756956994228352" .
```

General:

```
r1 r2 r3
<<r1 r2 r3>> e1 v1
             e2 v2
             e3 v3
             e4 v4
             e5 v5
```

For a relationship instance here, if we consider `<<` and `>>` unique, there are 8 constant elements and 10 variable elements (2 * E properties), hence `(8 + (2 * E)) * R` for `R` relationships.

# Conclusion

Star is clearly _supposed to be_ more compact in general than the Reified, where the constant no. of triple/statement elements in our example analysis is 12 (vs. 8 for Star), so it doesn't make sense why in any situation this can be bigger.

However, I'm far from a mathematician and there are lots of other variables I may not have considered. Bottomline is that we're stumped!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Much larger RDR file sizes versus reified format #26

The Problem

The Data

CSV

Reified

RDR

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Much larger RDR file sizes versus reified format #26

Description

The Problem

The Data

CSV

Reified

RDR

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions