-
-
Notifications
You must be signed in to change notification settings - Fork 174
Description
From the maintainer Li Haoyi: I'm putting a 500USD bounty on this issue, payable by bank transfer on a merged PR fixing this.
The acceptance criteria is to update upickle.core.BufferingElemParser, ujson.ElemParser, and upack.MsgPackReader to work on files >2gb.
uPickle's BufferingElemParser infrastructure already has a dropBufferUntil API that is used to indicate when the parser reaches a point from which it will never backtrack past. We can thus take advantage of these dropBufferUntil points to reset all indices to zero, accumulating the reset indices in a var droppedIndex: Long somewhere so we can properly report the non-reset index in parse error messages.
The various Visitor.* methods are hard-coded to take an index: Int as part of their signatures, and due to binary compatibility we cannot change that. For now, if the parse input grows beyond >2gb we can just pass in index = -1 to the downstream Visitors, which should be accustomed to taking -1 as the index as that is the value passed when the visitor is driven by a non-indexed input (e.g. feeding a ujson.Value into the visitor, or feeding a Scala data structure via upickle.default.transform(???).to(???))
This should be tested with a large unit test parsing a 5gb blob of synthetic data into a (smaller) in-memory data structure to validate that it works, with a run of the benchmark suite on JVM/JS/Native before and after the change to verify we aren't regressing anything too much
See https://github.com/orgs/com-lihaoyi/discussions/6 for other bounties and the terms and conditions that bounties operate under
Trying to read a JSON-file of the format Seq[Map[String, T]] using ujson.read(os.read.stream(path)) fails due to internal read buffer overflow.
The file is 2.7 GiB large.
Stack trace:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -84 is negative
at java.base/java.lang.System.arraycopy(Native Method)
at upickle.core.ByteBuilder.appendAll(ByteBuilder.scala:38)
at upickle.core.BufferingByteParser.appendBytesToBuilder(BufferingByteParser.scala:164)
at upickle.core.BufferingByteParser.appendBytesToBuilder$(BufferingByteParser.scala:24)
at ujson.ByteParser.appendBytesToBuilder(ByteParser.scala:19)
at ujson.ByteParser.parseStringToOutputBuilder(ByteParser.scala:638)
at ujson.ByteParser.parseStringKey(ByteParser.scala:629)
at ujson.ByteParser.parseNested(ByteParser.scala:390)
at ujson.ByteParser.parseTopLevel0(ByteParser.scala:338)
at ujson.ByteParser.parseTopLevel(ByteParser.scala:323)
at ujson.ByteParser.parse(ByteParser.scala:72)
at ujson.InputStreamParser$.transform(InputStreamParser.scala:26)
at ujson.ReadableLowPri.ujson$ReadableLowPri$$anon$2$$_$transform$$anonfun$1(Readable.scala:38)
at os.read$stream$$anon$1.readBytesThrough(ReadWriteOps.scala:253)
at ujson.ReadableLowPri$$anon$2.transform(Readable.scala:38)
at ujson.Readable$.transform(Readable.scala:15)
at ujson.Readable$.transform(Readable.scala:14)
at upickle.core.BufferedValue$.maybeSortKeysTransform(BufferedValue.scala:76)
at ujson.package$.transform(package.scala:8)
at ujson.package$.read$$anonfun$1(package.scala:15)
at upickle.core.TraceVisitor$.withTrace(TraceVisitor.scala:18)
at ujson.package$.read(package.scala:15)