Utilising Block Blob Staging for Parquet #595
Unanswered
NickEFallows
asked this question in
Q&A
Replies: 1 comment
-
|
Hi @NickEFallows, I'm not very familiar with Azure blob storage, but if I've understood this correctly, you want a way to write to separate blocks, and then also write the Parquet footer to another block, before finally doing a commit to convert the list of blocks to a blob object. ParquetSharp doesn't expose a way to create separate output streams for writing different parts of the file. But files are written sequentially, so you could probably do something like this by implementing your own subclass of |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am currently in the process of doing a proof of concept, where I have a very large amount of data stored (65million rows plus) in SQL, and I am exporting the data into Azure Blob Storage. The initial concept was using CSV, which appears very fast but produces a large file, which isn't optimal for transporting to other systems. I would like to use parquet as the file size is far smaller.
Using this library I have managed to export the data, using a stream and then creating a writer and then batching the data into rowgroups, however this is taking around 5 hours to complete compared to the 50minutes it takes with CSV :(
I have tried to implement the same process that was used for the CSV export where the batches of data are staged in Block Blobs before getting committed at the end of the process, however even though the file is exported and it has data in it, it cannot be read as it doesn't have the parquet footer in it.
Here is my current implementation:
However I would like to do something like this
Does anyone know if this is possible?
Beta Was this translation helpful? Give feedback.
All reactions