Skip to content

Get unused data from end of DecompressionStream? #39

Open
@Tschrock

Description

@Tschrock

I'm not very familiar with streams and compression, but hopefully this is understandable.

For deflate, the spec states "It is an error if there is additional input data after the ADLER32 checksum."
For gzip, the spec says "It is an error if there is additional input data after the end of the "member"."

As expected, Chrome's current implimentation throws a TypeError ("Junk found after end of compressed data.") when extra data is written to a DecompressionStream.

This error can be caught and ignored, but there doesn't seem to be a way of retrieving the already-written-but-not-used "junk" data. There seems to be an assumption here that developers already know the length of the compressed data, and can provide exactly that data and nothing more. On the contrary, this "junk" data can be very important in cases where the compressed data is embedded in another stream and you don't know the length of the compressed data.

A good example of this is Git's PackFile format, which only tells you the size of the uncompressed data, not the compressed size. In such a case you must rely on the decompressor to tell you when it's done decompressing data, and then handle the remaining data.

My attempt at putting together an example:

// A stream with two compressed items
// deflate("Hello World") + deflate("FooBarBaz")
const data = new Uint8Array([
    0x78, 0x9c, 0xf3, 0x48, 0xcd, 0xc9, 0xc9, 0x57, 0x08, 0xcf, 0x2f, 0xca, 0x49, 0x01, 0x00, 0x18, 0x0b, 0x04, 0x1d,
    0x78, 0x9c, 0x73, 0xcb, 0xcf, 0x77, 0x4a, 0x2c, 0x72, 0x4a, 0xac, 0x02, 0x00, 0x10, 0x3b, 0x03, 0x57,
]);

// Decompress the first item
const item1Stream = new DecompressionStream('deflate');
item1Stream.writable.getWriter().write(data).catch(() => { /* Rejects with a TypeError: Junk found after end of compressed data. */ });
console.log(await item1Stream.readable.getReader().read()); // "Hello World"

// How do I get the remaining data (the "junk") in order to decompress the second item?
// I've already written it to the previous stream, and there's nothing to tell me how much was used or what's left over.
const item2Stream = new DecompressionStream('deflate');
item2Stream.writable.getWriter().write(getRemainingDataSomehow());
console.log(await item2Stream.readable.getReader().read()); // "FooBarBaz"

Now, as a workaround, I could write the data to my first stream one byte at a time, saving the most recently written byte and carrying it over when the writer throws that specific exception - But writing one byte at a time feels very inefficient and adds a lot of complexity, and checking for that specific error message seems fragile (it might chage, and other implimentations might use a different message.)

Zlib itself provides a way to know what bytes weren't used (though I don't know any details about how.)
Python's zlib api provides an unused_data property that contains the unused bytes.
Node's zlib api provides a bytesWritten property that can be used to calculate the unused data.
It would be great to have something similar available in the DecompressionStream api.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions