Replies: 8 comments
-
|
Update: R can read files from zip via unz() function. |
Beta Was this translation helpful? Give feedback.
-
|
New example - pod5 from nanopore https://pod5-file-format.readthedocs.io/en/latest/SPECIFICATION.html https://github.com/nanoporetech/pod5-file-format for storing raw sequencing data - several dataframes per file. In pod5, they "glue" several Arrow-tables (same Arrow representation as in memory - so called Feather v2 format - not as suited for long-term storage as parquet) into a custom "container" https://pod5-file-format.readthedocs.io/en/latest/SPECIFICATION.html#combined-file-layout |
Beta Was this translation helpful? Give feedback.
-
|
huh, looks very interesting! Do you have any idea why they didn't go with Apache Parquet+container instead? |
Beta Was this translation helpful? Give feedback.
-
|
some interesting observations from Twitter: https://twitter.com/Hasindu2008/status/1619914433438040065 |
Beta Was this translation helpful? Give feedback.
-
Not sure - but isn't parquet much more complex ? also maybe this arrow-one is still faster ? there is also that https://youtu.be/nrXoZ3NTmnU?si=aUCniTgr2bm1br0X |
Beta Was this translation helpful? Give feedback.
-
|
another slightly relevant thing: "delta lake" |
Beta Was this translation helpful? Give feedback.
-
|
huh, never heard of delta lake! Some of its advantages could be useful to us, e.g. deleting columns |
Beta Was this translation helpful? Give feedback.
-
|
Btw, one of the key issues has always been the lack of Out-Of-Core merge-sort for parquet files (currently, we use Unix's sort on our .tsv pairs). Two potential solutions have emerged meanwhile: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Issue: storing Hi-C contacts in a gzipped .tsv cause major slowdowns for some computations. We need to pick a binary container and write software for common operations.
.tsv/.csv:
Cons:
Pros:
The alternative is to store pair tables in existing binary container files. The two options are:
HDF5:
Pros:
Cons:
Parquet:
Pros:
Cons:
Personally, I'm not happy with either of the solutions. Thoughts?...
Beta Was this translation helpful? Give feedback.
All reactions