-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Description
Description
I noticed that the GzipDecompressor class in tornado.util only decompresses the first gzip member when processing concatenated gzip streams, while discarding all subsequent members. Thus, this leads to data loss when handling multi member gzip files.
A gzip file can contain multiple members concatenated together, and decompressors should process all of them. This is commonly seen in HTTP responses from servers that stream gzip data and streaming applications.
Steps to Reproduce
This is a python script to reproduce the issue:
import gzip
from tornado.util import GzipDecompressor
data1 = b"This is some example data that will be compressed using gzip."
data2 = b"Here is some more example data to demonstrate gzip compression."
member1 = gzip.compress(data1)
member2 = gzip.compress(data2)
concatenated = member1 + member2
decompressor = GzipDecompressor()
decompressed_data = decompressor.decompress(concatenated)
expected_data = data1 + data2
print("Decompressed data:", decompressed_data)
print("Expected data:", expected_data)
assert decompressed_data == expected_data, "Decompressed data does not match expected data"Output:
Concatenated data: b'\x1f\x8b\x08\x00\xcd\xe5:i\x02\xff\r\xc2\x81\r\x800\x08\x04\xc0U~\x02\xa7q\x01\xb4\x9f\x96\x04,\x11\x8c\xc6\xe9\xf5r\xeb\xd0\xc4?\xa7\x13|\xc4\xc3\x88&%\xa8!\x85[\xcd\xb0\x11\xfb\xf48\x99\xc9\x86+\xf5\xe8\xe8\xaf\xc6\xf2\x01c\xac%\xed=\x00\x00\x00\x1f\x8b\x08\x00\xcd\xe5:i\x02\xff\r\xc7\xc1\r\x800\x0c\x03\xc0U<\x01s\xb0FD\xad*\x12\xa9\xa3&\x0f\xc4\xf4p\xbf;\xb9\t/\x94\x82\x08\xfd\xe1c\x9171\xac\r-\x0c\x86V\xf5\xb6&\xe6\xeb\x89K\x91\x9bU\xaeu|\x90`\xd6\xcd?\x00\x00\x00'
Decompressed data: b'This is some example data that will be compressed using gzip.'
expected data: b'This is some example data that will be compressed using gzip.Here is some more example data to demonstrate gzip compression.'
Traceback (most recent call last):
File "/home/dev/my_projects/tornado/./tornado/test/test_util.py_68-105_issue_1.py", line 22, in <module>
assert decompressed_data == expected_data, "Decompressed data does not match expected data"
AssertionError: Decompressed data does not match expected data
Expected Behavior
GzipDecompressor should decompress all concatenated gzip members and return the complete decompressed data.
Actual Behavior
I observed that only the first gzip member is decompressed, while the remaining members are ignored without any note or warning, causing data loss.
Root Cause
The implementation delegates to zlib.decompressobj but doesn't handle unused_data. When a gzip member ends, zlib places any following bytes in .unused_data, but GzipDecompressor never checks or processes this attribute, so subsequent members are lost.
Proposed Fix
When decompressobj.unused_data is non-empty after decompression, it is posible to create a new decompressor, and continue processing the unused bytes:
def decompress(self, value: bytes, max_length: int = 0) -> bytes:
if self._flushed:
raise RuntimeError("Cannot call decompress() after flush()")
data = value
out = bytearray()
remaining = max_length
while True:
if remaining:
chunk = self.decompressobj.decompress(data, remaining)
else:
chunk = self.decompressobj.decompress(data)
out.extend(chunk)
if remaining:
remaining = max(0, max_length - len(out))
if remaining == 0:
break
# here it handles the concatenated gzip members:
unused = getattr(self.decompressobj, "unused_data", b"")
if unused:
data = unused
self.decompressobj = zlib.decompressobj(16 + zlib.MAX_WBITS)
continue
break
return bytes(out)Test Case
The following unit test shows the issue:
class TestGzipDecompressor(unittest.TestCase):
def test_concatenated_gzip_members(self):
"""Testing that concatenated gzip members are fully decompressed."""
data1 = b"First gzip member content."
data2 = b"Second gzip member content."
member1 = gzip.compress(data1)
member2 = gzip.compress(data2)
concatenated = member1 + member2
decompressor = GzipDecompressor()
result = decompressor.decompress(concatenated)
expected = data1 + data2
self.assertEqual(result, expected,
"Concatenated gzip members should be fully decompressed")Environment
- Tornado version: tested on current master branch
- Python version: 3.12.0