Skip to content

uPickle fails to parse JSON files >2gb due to integer overflow in parser (500USD Bounty) #656

@megri

Description

@megri

From the maintainer Li Haoyi: I'm putting a 500USD bounty on this issue, payable by bank transfer on a merged PR fixing this.

The acceptance criteria is to update upickle.core.BufferingElemParser, ujson.ElemParser, and upack.MsgPackReader to work on files >2gb.

uPickle's BufferingElemParser infrastructure already has a dropBufferUntil API that is used to indicate when the parser reaches a point from which it will never backtrack past. We can thus take advantage of these dropBufferUntil points to reset all indices to zero, accumulating the reset indices in a var droppedIndex: Long somewhere so we can properly report the non-reset index in parse error messages.

The various Visitor.* methods are hard-coded to take an index: Int as part of their signatures, and due to binary compatibility we cannot change that. For now, if the parse input grows beyond >2gb we can just pass in index = -1 to the downstream Visitors, which should be accustomed to taking -1 as the index as that is the value passed when the visitor is driven by a non-indexed input (e.g. feeding a ujson.Value into the visitor, or feeding a Scala data structure via upickle.default.transform(???).to(???))

This should be tested with a large unit test parsing a 5gb blob of synthetic data into a (smaller) in-memory data structure to validate that it works, with a run of the benchmark suite on JVM/JS/Native before and after the change to verify we aren't regressing anything too much

See https://github.com/orgs/com-lihaoyi/discussions/6 for other bounties and the terms and conditions that bounties operate under


Trying to read a JSON-file of the format Seq[Map[String, T]] using ujson.read(os.read.stream(path)) fails due to internal read buffer overflow.

The file is 2.7 GiB large.

Stack trace:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -84 is negative
        at java.base/java.lang.System.arraycopy(Native Method)
        at upickle.core.ByteBuilder.appendAll(ByteBuilder.scala:38)
        at upickle.core.BufferingByteParser.appendBytesToBuilder(BufferingByteParser.scala:164)
        at upickle.core.BufferingByteParser.appendBytesToBuilder$(BufferingByteParser.scala:24)
        at ujson.ByteParser.appendBytesToBuilder(ByteParser.scala:19)
        at ujson.ByteParser.parseStringToOutputBuilder(ByteParser.scala:638)
        at ujson.ByteParser.parseStringKey(ByteParser.scala:629)
        at ujson.ByteParser.parseNested(ByteParser.scala:390)
        at ujson.ByteParser.parseTopLevel0(ByteParser.scala:338)
        at ujson.ByteParser.parseTopLevel(ByteParser.scala:323)
        at ujson.ByteParser.parse(ByteParser.scala:72)
        at ujson.InputStreamParser$.transform(InputStreamParser.scala:26)
        at ujson.ReadableLowPri.ujson$ReadableLowPri$$anon$2$$_$transform$$anonfun$1(Readable.scala:38)
        at os.read$stream$$anon$1.readBytesThrough(ReadWriteOps.scala:253)
        at ujson.ReadableLowPri$$anon$2.transform(Readable.scala:38)
        at ujson.Readable$.transform(Readable.scala:15)
        at ujson.Readable$.transform(Readable.scala:14)
        at upickle.core.BufferedValue$.maybeSortKeysTransform(BufferedValue.scala:76)
        at ujson.package$.transform(package.scala:8)
        at ujson.package$.read$$anonfun$1(package.scala:15)
        at upickle.core.TraceVisitor$.withTrace(TraceVisitor.scala:18)
        at ujson.package$.read(package.scala:15)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions