[DO NOT MERGE] fix(csharp/databricks): Reduce LZ4 decompression memory by 96% #3654
      
        
          +276
        
        
          −5
        
        
          
        
      
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
[DO NOT MERGE] Fix: Reduce LZ4 decompression memory usage by 96%
This PR reduces accumulated memory allocations (total allocations over time) but may not significantly reduce peak concurrent memory usage. Requires discussion on:
Summary
Reduces LZ4 internal buffer memory allocation from ~900MB to ~40MB (96% reduction) for large Databricks query results by implementing a custom ArrayPool that supports buffer sizes larger than .NET's default 1MB limit.
Important: This optimization primarily reduces:
But does NOT significantly reduce:
parallelDownloads=1, peak is still ~8-16MB (1-2 buffers in use)Problem
maxBlockSize, but .NET'sArrayPool<byte>.Sharedhas a hardcoded 1MB limitProfiler Evidence
Solution
Created a custom ArrayPool by overriding K4os.Compression.LZ4's buffer allocation methods:
StreamLZ4FrameReaderwith custom ArrayPool (4MB max, 10 buffers)CustomLZ4FrameReaderCustomLZ4DecoderStreaminstead of defaultLZ4Stream.Decode()Key Implementation
Results
Memory Usage
Performance
Why This Works
K4os Library Design:
LZ4FrameReaderhasvirtualmethods:AllocBuffer()andReleaseBuffer()BufferPool.Alloc()→ArrayPool<byte>.Shared(1MB limit)Buffer Lifecycle:
parallelDownloads=1(default), only 1-2 buffers active at onceConcurrency Considerations
Recommendation: If using
parallel_downloads > 4, consider increasingmaxArraysPerBucketin future enhancement.Files Changed
New Files
src/Drivers/Databricks/CustomLZ4FrameReader.cs(~80 lines)src/Drivers/Databricks/CustomLZ4DecoderStream.cs(~118 lines)Modified Files
src/Drivers/Databricks/Lz4Utilities.cs- UseCustomLZ4DecoderStream, add telemetryTesting
Validation
Telemetry
Added
lz4.decompress_asyncactivity event:{ "compressed_size_bytes": 32768, "actual_size_bytes": 4194304, "buffer_allocated_bytes": 4194304, "compression_ratio": 128.0 }Technical Decisions
Why Override Instead of Fork?
Why ArrayPool.Create()?
Why 4MB maxArrayLength?
maxBlockSize- pool matches exactlyWhy 10 maxArraysPerBucket?
parallelDownloads=1uses 1-2 buffersFuture Enhancements
parallel_downloadsconfigmaxBlockSizeReferences