Description
We have Azure Container App Jobs that read JSON files containing security log events from a storage account and then insert the events into Delta tables in another storage account..
These are append blobs containing Advanced Hunting events that are being written to every X minutes. We do not have any control of how they are being written.
Between invocations we are keeping track of how much of the file we have processed (offset) and we stream into a PyArrow buffer:
buffer = pa.allocate_buffer(length)
output = pa.output_stream(buffer)
# some offset and length calculation
download = client.download_blob(
offset=offset,
length=length,
progress_hook=progress_callback,
)
bytes_read = download.readinto(output)
If we want to download 2GB and use the max_single_get_size
with its default value of 32MB
then the the Python SDK will download multiple chunks of size 4MB. Unfortunately, if the blob is being written to at the same time we are downloading one of the chunks it will notice that the ETAG has changed and it will throw a ResourceModifiedError
.
This thread explains in detail what is going on #30233 (comment).
While this makes sense for a regular blob, why is this behaviour necessary for an append blob? Is there any match condition that will allow us to ignore the ETAG change?
The only way I can think of is either:
- Increasing the
max_single_get_size
. However, even when running in an ACA large downloads are unstable. - Doing the chunking ourselves.
Are we doing something wrong here?