-
Notifications
You must be signed in to change notification settings - Fork 29
Check and decompress gzip file stream before processing as Avro file #571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Check and decompress gzip file stream before processing as Avro file #571
Conversation
4c800f2 to
1e3080e
Compare
1e3080e to
5eaf3b1
Compare
| protected void inputOpened(final InputStream input) throws IOException { | ||
| dataFileStream = new DataFileStream<>(input, datumReader); | ||
| InputStream decompressedInput = input; | ||
| if (!input.markSupported()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are mark/reset generally not supported by the input stream, or what are the conditions potentially leading to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, since BufferedInputStream always support mark/reset, no need to check for markSupported.
| decompressedInput.reset(); | ||
|
|
||
| // GZIP magic bytes: 0x1f 0x8b | ||
| if (bytesRead == 2 && magic[0] == (byte) 0x1f && magic[1] == (byte) 0x8b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the connector is supposed to support other file compression types, e.g. snappy and zstd. wondering if we should handle all possible magic bytes here, or if conditionally instantiating the right type of input stream could be done in a much easier way ie. based on file.compression.type config, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using file.compression.type is more explicit, I think this is a better approach than detecting magic bytes of all compression types.
| decompressedInput.mark(2); | ||
| byte[] magic = new byte[2]; | ||
| int bytesRead = decompressedInput.read(magic); | ||
| decompressedInput.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the variable name decompressedInput sounds wrong at this point, indeed it might be an input that is not compressed at all, or one that will get identified as a compressedInput. since the purpose of this variable is to instantiate a BufferInputStream that supports mark/reset, why not call it just bufferedInput instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Renamed now.
jclarysse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tanbt for your contribution!
We'd probably need to refine a few things please, see my comments.
88ee241 to
85060b0
Compare
S3 Source connector might failed to source compressed (gzip) avro files from S3 -> Kafka.
This change expects to decompress the input stream before processing it as Avro data stream.
After the changes, I tried again but couldn't reproduce the issue, and the data is available in the destination Kafka.