Utilising Block Blob Staging for Parquet #595

NickEFallows · 2025-12-01T08:08:15Z

NickEFallows
Dec 1, 2025

I am currently in the process of doing a proof of concept, where I have a very large amount of data stored (65million rows plus) in SQL, and I am exporting the data into Azure Blob Storage. The initial concept was using CSV, which appears very fast but produces a large file, which isn't optimal for transporting to other systems. I would like to use parquet as the file size is far smaller.
Using this library I have managed to export the data, using a stream and then creating a writer and then batching the data into rowgroups, however this is taking around 5 hours to complete compared to the 50minutes it takes with CSV :(

I have tried to implement the same process that was used for the CSV export where the batches of data are staged in Block Blobs before getting committed at the end of the process, however even though the file is exported and it has data in it, it cannot be read as it doesn't have the parquet footer in it.

Here is my current implementation:

using var blobStream = await tempBlobClient.OpenWriteAsync(overwrite: true).ConfigureAwait(false);
using var writer = await ParquetWriter.CreateAsync(parquetInfo.Schema, blobStream).ConfigureAwait(false);

try
{
    // Write each batch of rows as a complete Parquet file and keep in memory temporarily
    while (await parquetInfo.Dr.ReadAsync().ConfigureAwait(false))
    {
        var values = new Object[parquetInfo.Dr.FieldCount];
        parquetInfo.Dr.GetValues(values);
        parquetInfo.RowBuffer.Add(values);
        parquetInfo.TotalRowCount++;

        if (parquetInfo.RowBuffer.Count < parquetInfo.MaxRowsPerBlock)
            continue;

        await WriteDataToParquetObject(parquetInfo, writer).ConfigureAwait(false);
    }

    // Write remaining data if any
    if (parquetInfo.RowBuffer.Count > 0)
    {
        await WriteDataToParquetObject(parquetInfo, writer).ConfigureAwait(false);
    }
}
finally
{
    // Clean up partial file streams
    await writer.DisposeAsync().ConfigureAwait(false);
    await blobStream.FlushAsync().ConfigureAwait(false);
    await blobStream.DisposeAsync().ConfigureAwait(false);
    await parquetInfo.BlobClient.StartCopyFromUriAsync(tempBlobClient.Uri);
}

However I would like to do something like this

while (await parquetInfo.Dr.ReadAsync().ConfigureAwait(false))
{
	CreateParquetBatch(parquetInfo);
	parquetInfo.RowCount++;
	parquetInfo.TotalRowCount++;

	if (parquetInfo.RowCount < parquetInfo.MaxRowsPerBlock)
		continue;		

	await TransferDataToBlockBlob(blobClient, parquetInfo).ConfigureAwait(false);	

	parquetInfo.ResetStream();

	parquetInfo.RowCount = 0;
}

if (parquetInfo.RowCount > 0)
	await TransferDataToBlockBlob(blobClient, parquetInfo).ConfigureAwait(false);	

await blobClient.CommitBlockListAsync(parquetInfo.BlockIds).ConfigureAwait(false);

return parquetInfo.TotalRowCount;

Does anyone know if this is possible?

adamreeve · 2025-12-01T21:45:32Z

adamreeve
Dec 1, 2025
Collaborator

Hi @NickEFallows, I'm not very familiar with Azure blob storage, but if I've understood this correctly, you want a way to write to separate blocks, and then also write the Parquet footer to another block, before finally doing a commit to convert the list of blocks to a blob object.

ParquetSharp doesn't expose a way to create separate output streams for writing different parts of the file. But files are written sequentially, so you could probably do something like this by implementing your own subclass of System.IO.Stream that buffers written bytes and periodically flushes them to blocks, and commits the block list when it is closed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utilising Block Blob Staging for Parquet #595

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Utilising Block Blob Staging for Parquet #595

Uh oh!

Uh oh!

NickEFallows Dec 1, 2025

Replies: 1 comment

Uh oh!

adamreeve Dec 1, 2025 Collaborator

NickEFallows
Dec 1, 2025

adamreeve
Dec 1, 2025
Collaborator