-
Notifications
You must be signed in to change notification settings - Fork 90
Add raw_decode
method to JSON and MsgPack decoders
#821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Adds a new `raw_decode` method on the JSON and MsgPack `Decoder` classes that allow decoding objects with trailing data after them. This mirrors the interface of the `raw_decode` method in the Python standard library's `json.JSONDecoder` class. This can be used to, for example, parse several concatenated messages, enabling similar functionality to that available through `msgpack.Unpacker` in the `msgpack` package.
Thanks for this one. Will this be useful for reading compressed files record-by-record? This is what I use the old msgpack lib for most: with open('archive.zst', 'rb') as fh:
with zstd.ZstdDecompressor().stream_reader(fh, read_size=8388608) as sr:
for obj in msgpack.Unpacker(sr): |
If you know ahead of time the maximum decompressed size of a MsgPack message ahead of time, then yes, it can be used for this, although it would require more code to deal with the details. For example: def iter_messages_from_file(f, decoder, max_message_size):
buffer = f.read(max_message_size)
while buffer:
decoded, bytes_used = decoder.raw_decode(buffer)
yield decoded
buffer = buffer[bytes_used:] + f.read(bytes_used) I should note that the implementation in the snippet above is rather inefficient, as it will be doing a lot of memory copies (both the slice and the If, on the other hand, the full decompressed data can be available in memory at once, it is quite efficient to iterate over all concatenated messages with minimal copying: def iter_messages_in_buffer(data, decoder):
start = 0
with memoryview(data) as overall_view:
while next_index < len(overall_view):
with memoryview(overall_view[start:]) as slice_view:
decoded, bytes_used = decoder.raw_decode(slice_view)
yield decoded
start += bytes_used I have pondered efficiently handling the streaming case with unbounded message sizes myself, but it is far from straightforward to add. There are two cases:
If the message is truncated, you will need to read more data into the buffer and try again. There are two issues with this:
To resolve this correctly, you would need to be able to halt and resume the MsgPack decoder. Having read the implementation of msgspec, I think this would hugely complicate the core, and is unlikely to be in scope for the project, though it is ultimately not my decision to make. I did come up with one workaround that allows you to use msgspec while still handling unbounded message sizes from streaming sources. The key observation is that doing a very cursory parse of a MsgPack message just to determine its length, in a way that can be suspended and resumed, is very easy, due to some specifics of the MsgPack format. Suppose I had a "length detector" object to which you could feed data chunks, and it would be able to tell you once one of those chunks completed a message (and where in the chunk the message had been completed). Then you could do something like this: length_detector = LengthDetector()
buffer = bytearray()
while True:
chunk = f.read(1) # Obviously inefficient, but cuts down on bookkeeping for the sake of example.
if not chunk:
break
buffer.extend(chunk)
# completes_message returns None if the accumulated message remains incomplete.
# Returns the index into the provided chunk at which the accumulated message ends,
# if the chunk provided completes the message.
if length_detector.completes_message(chunk):
# Length detector has detected that buffer contains a complete message.
yield msgspec.msgpack.decode(buffer, Example)
buffer.clear() This is able to decode streaming messages of unbounded size in O(n) time. You may notice it also can run with normal I have prototyped in my personal time such a I would also be happy to contribute the implementation to msgspec if the maintainer so chooses, though it does feel somewhat disconnected from the rest of msgspec's functionality, which is why it's not my first choice to do so. |
My familiarity with the source code here is extremely limited, but I think I saw |
I see that; In any case, I think it's probably out of scope of this PR. The interface introduced in this PR can be efficiently used when the entire sequence of messages is already available in memory. This is the same use case that the standard library chooses to support in Appendix: If you want to measure class A(msgspec.Struct):
a: Any
class B(msgspec.Struct):
pass
data = msgspec.msgpack.encode(A([0] * 1_000_000))
# 'data' at the top level is a mapping with a key 'a',
# but type B has no field 'a',
# so msgspec will invoke mpack_skip to skip over the value of key 'a'.
msgspec.msgpack.decode(data, type=B) |
Adds a new
raw_decode
method on the JSON and MsgPackDecoder
classes that allow decoding objects with trailing data after them. This mirrors the interface of theraw_decode
method in the Python standard library'sjson.JSONDecoder
class.This can be used to, for example, parse several concatenated messages, enabling similar functionality to that available through
msgpack.Unpacker
in themsgpack
package.