Binary data, big integers, and CBOR, TOML, XML, YAML support by 01mf02 · Pull Request #284 · 01mf02/jaq

01mf02 · 2025-05-22T09:32:33Z

This PR adds a large number of new features and improvements to jaq.

CBOR, TOML, XML, YAML reading/writing

This PR adds the --from and --to CLI arguments to jaq.
That allows you, for example, to read a YAML file, run a filter on it, and output its results as CBOR.
Furthermore, jaq now provides a few new filters, namely:

fromcbor
tocbor
fromyaml
toyaml
fromtoml
totoml
fromxml
toxml

In addition, this PR also extends fromjson to be able to yield multiple (or no) output values. For example, "1 2" | fromjson yields two values in jaq, namely 1 and 2. In jq, this would fail with an error message. Idem for "" | fromjson, which yields no value in jaq and fails in jq.
This change makes the output of jaq --raw-input 'fromjson' file the same as the output of jaq --from json '.' file.
This also holds for all other formats, which can also yield an arbitrary number of output values.

Extended value type

This PR extends jaq's value type by:

binary data (see below),
non-string object keys, and
big integers.

I added these value variants originally because they are available in YAML/CBOR, but these turn out to be quite useful in general.

Non-string object keys allow using arbitrary values as object keys, e.g. {null: 0, true: 1, 2: 3, "str": 4, ["arr"]: 5, {}: 6}.
This means that you now effectively have access to a high-performance hash map in your jq programs when you run them with jaq! Note that to construct such objects in jq, you will need to surround non-string keys with parentheses, i.e. {(null): 0, (true): 1, (2): 3, "str": 4, (["arr"]): 5, ({}): 6}.

Integers that cannot be represented as machine-sized integers (isize) are now automatically represented as big integers (BigInt). For example, if you add two normal integers and the result does not fit into a machine-sized integer, it gets calculated as big integer automatically.

Binary data

This PR adds support for byte strings to jaq, following the design process in #309. Thanks a lot to @Maxdamantus for the very generous help!

In jq, all strings are UTF-8 strings. This PR generalises the notion of strings to mean "an array of bytes with an interpretation", where the interpretation states whether we want to treat the string as UTF-8 data or just as regular bytes without any meaning attached.

You can create a byte string from a UTF-8 string by tobytes. This is a constant-time operation if the input is a string.
This allows for several characteristics that are not currently available in jq/fq.

Memory-mapped string loading

This is a crucial prerequisite for handling binary data efficiently in a jq implementation.
Currently, neither jq nor fq seem to do this.

$ cargo run -- -Rs 'tobytes | .[:1]' ~/Downloads/ubuntu-22.04.5-desktop-amd64.iso
"\xeb"
0.07user 0.03system 0:00.11elapsed
$ /usr/bin/time fq -Rs 'tobytes | .[:1]' ~/Downloads/ubuntu-22.04.5-desktop-amd64.iso
Command terminated by signal 9 <---- ran out of RAM
6.69user 37.50system 0:54.94elapsed
$ /usr/bin/time jq -Rs '.[:1]' ~/Downloads/ubuntu-22.04.5-desktop-amd64.iso 
"�"
45.85user 2.50system 0:59.29elapsed

We use tobytes to interpret the raw, slurped input (-Rs) as bytes, which then allows us to use constant-time indexing by bytes (instead of linear-time indexing by characters, done by default for UTF-8 strings in jq).
We can see that jaq terminates almost instantly, whereas fq runs out of memory, and jq takes about 45 seconds, until it reports the first byte of the input. Note that because jq does not implement tobytes, I used regular UTF-8 string slicing as next-best thing to get at least some result.

Constant-time byte string slicing

In the next example, we build up a long byte string ("aa...a"), then slice it repeatedly, removing always one element until the string is nearly empty.

$ /usr/bin/time cargo run --release -q -- -n '[limit(100000; repeat("a" | tobytes))] | add | last(recurse(.[1:]; length > 0))'
"a"
0.23user 0.04system 0:00.29elapsed 97%CPU [...]
$ /usr/bin/time cargo run --release -q -- -n '[limit(200000; repeat("a" | tobytes))] | add | last(recurse(.[1:]; length > 0))'
"a"
0.41user 0.03system 0:00.45elapsed 99%CPU [...]

$ /usr/bin/time fq -n '[limit(100000; repeat("a" | tobytes))] | add | last(recurse(.[1:]; length > 0))'
"a"
12.60user 0.20system 0:07.05elapsed [...]
$ /usr/bin/time fq -n '[limit(200000; repeat("a" | tobytes))] | add | last(recurse(.[1:]; length > 0))'
"a"
45.63user 0.39system 0:25.77elapsed [...]

Here, we can see that jaq (0.23 seconds) performs significantly faster than fq (12.6 seconds), and doubling the string size doubles jaq's runtime (0.41 seconds), whereas it roughly squares fq's runtime (45.6 seconds). I opened an issue at wader/fq#1168.

Byte offset calculation

jaq adds a byteoffset filter that allows to determine the offset of a string that (at least partially) overlaps in memory with another string. In particular, if we have a string $s and an integer $n, then ($s[$n:] | byteoffset($s)) == $n.
To the best of my knowledge, such a filter is available neither in jq nor fq.

Raw strings

In addition to UTF-8 and byte strings, this PR also introduces the concept of "raw strings". These are byte strings, but with the twist that they are output "as-is", meaning as if the --raw-output flag was used. However, --raw-output only impacts "top-level values", meaning that it prints strings part contained in an array or object like a normal string. A raw string, on the other hand, is always printed "as-is", even if it is inside of an array or object. That allows users to customise the output in new ways, e.g. coloring output differently, outputting images on the terminal, or recreate fq's pretty printing for bytes.

Implications

The availability of high-performance binary data operations, in particular constant-time loading (via memory mapping) and constant-time slicing makes it feasible to write binary format decoders in the jq language.
That's because parsing usually involves loading data once, followed by frequent input slicing. Having these operations in linear time is a prerequisite for fast parsing.
Furthermore, the new byteoffset filter allows us to quickly determine the range of a string slice with respect to a parent string. That is useful because it relieves us from tracking byte ranges separately, as is done in fq.

All this is exciting because it shows the path into a future where binary format decoders can be written once, then be reused between fq and jaq (and perhaps even jq at some point).

JSON syntax

This PR extends the value type used by jaq in order to accommodate values that appear in formats like CBOR and YAML, but cannot be represented by JSON. In particular, this includes objects with non-string keys (e.g. {1: 2} and binary data.
Note, however, that even jq values do not map perfectly one-to-one to JSON values; the most prominent case is NaN, which jq prints as null. The idea behind jq's strategy is to map all values to JSON, at the cost of losing information.
So far, jaq has followed this path, but the appearance of many more values that cannot be losslessly encoded as JSON has made this strategy increasingly less appealing. Therefore, this PR marks the adoption of a new paradigm: jaq encodes values as valid JSON if and only if they can be encoded losslessly as JSON.

That means that if a value can not be losslessly encoded as valid JSON, it is encoded as invalid JSON. For example, jaq -nc '[nan, infinite]' now yields [NaN,Infinity] (analogous to JavaScript). Furthermore, strings are printed as-is, even if they contain invalid UTF-8 sequences. The idea is the same: if jaq cannot encode a value losslessly as JSON, it encodes the value as invalid JSON.
This strategy allows you to find invalid JSON more easily, just by running it through a JSON validator.
(One alternative to this behaviour would be to fail with an error message when trying to print a non-JSON-encodable value. However, for this to work, we would either need to traverse a value twice --- once to validate, once to print --- or construct a JSON-encoded value during validation, resulting in a potentially large allocation. Both alternatives cost time and/or memory, which makes both of them unattractive as default behaviour.)

As a side effect of this new paradigm, jaq's JSON parser is now slightly more liberal in what input it accepts. In particular, jaq's JSON parser now accepts strings containing non-UTF-8 characters (such as the FF byte) as well as objects containing non-string keys. The reasoning behind allowing these changes is that they actually make the parser faster and simpler, while continuing to parse valid JSON as before. This is, for the time being, the general policy to determine future changes of the parser. For example, adding comment syntax (// or /* */) is not permissible under this policy, because this would make the parser more complex and slower.

Byte string syntax

Byte strings are printed by default with a syntax that is compatible with C and JavaScript, using ASCII characters, escape characters such as \n, \t, and hexadecimal escape characters such as \xFF for anything else.
This format has been chosen because it is human-readable (in contrast to base64), relatively compact, and machine-processable. For example, the sequence of bytes from 0 to 256 is printed as follows:

$ jaq -n '[range(256)] | tobytes'
"\x00\x01\x02\x03\x04\x05\x06\x07\b\t\n\x0b\f\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"

To prevent confusion with regular strings, jaq prints byte strings with red color when output to a terminal, whereas it prints regular strings as green.

To get back actual binary data from such a string, there are several options:

Python: Write a file binary.py with the contents import sys; sys.stdout.buffer.write("\x00..\xFF".encode("latin1")) and run it with python binary.py.
Node: Write a file binary.js with the contents process.stdout.write(Buffer.from("\x00..\xff", "latin1")) and run it with node binary.js.

C: Write a file binary.c with the contents:

#include <stdio.h>
void main() {
    const char data[] = "\x00..\xff";
    fwrite(data, (sizeof data) - 1, 1, stdout);
}

Run it with gcc binary.c -o binary && ./binary.

You can verify the output with xxd, for example.

For other formats that support byte strings natively, such as CBOR and YAML, byte strings are printed using their native representation. For example:

$ jaq -n --to yaml '[range(32)] | tobytes'
---
!!binary AAECAwQFBgcICQoLDA0ODxAREhMUFRYXGBkaGxwdHh8=
...

TODO:

Document --from and --to in help.txt
Document YAML behaviour (supported tags !!int etc.)
Use new saphyr-parser version once Using saphyr with Rust 1.65 saphyr-rs/saphyr#50 is closed
Implement fromxml, toxml, fromyaml, toyaml
Encapsulate lexer errors in newtype to ease updates
Recover shared YAML values
Write parsing tests

wader · 2025-06-05T09:51:21Z

jaq-formats/README.md

+
+### Tags or The TAC Architecture
+
+Tags are represented by "TAC" objects. A TAC object may have the following fields:


Seems like a nice mix between friendliness and lossless. Is a jaq invention or standardized somewhere?

Thanks. :) It's indeed a "jaq invention". I found this by formulating a few example processing tasks, and then tried to find a nice representation that makes these tasks compact to formulate in jq.

Also move named filters from jaq-json to own module.

This is necessary because `"\(.)"` writes JSON via `write::write`, which uses `std::io`. Therefore, we have to depend on `std`.

01mf02 added 25 commits February 8, 2025 18:21

Initial work on XML support.

460e55c

Read XML files in jaq.

a710b29

Add missing files.

b878103

Merge branch 'main' into xml

054321c

Make new & simpler XML parser.

a619be0

XML value printing.

666ba5a

Remove old XML parser.

9f9bd1c

Expose Map type.

a508004

Separate XML value conversion from printing.

bffafef

XML output!

6e8ce9d

Clippy.

12ef5eb

Remove quick-xml.

70d7971

Do not try to handle HTML.

aa11092

Format.

b31c13f

Error handling for XML parsing!

88a75e3

Report error for unclosed DOCTYPE.

8ee10ff

Serialise remaining kinds of JSON values to XML.

62c0ce2

More liberal invalid data errors.

83aeab1

Serialisation errors!

40a28b7

Remove jaq-core dependency.

eaa3dc5

Rename jaq-xml to jaq-formats.

a712024

Write a little README for format support.

3bbd3a6

Add missing lib.rs.

1649eee

Make a few subfunctions for XML conversion.

c38b6b8

More on XML.

6f9b2be

wader reviewed Jun 5, 2025

View reviewed changes

01mf02 added 4 commits June 5, 2025 18:03

Move extern crate alloc.

956506c

Reformat.

767ad7d

Simplify --join-output handling.

b7301f2

Add dependendy on saphyr-parser.

df8259b

01mf02 added 2 commits September 5, 2025 16:54

Make tojson/toyaml preserve invalid UTF-8.

50f1f7d

YAML tests.

8e96c2d

01mf02 force-pushed the xml branch from d1c5158 to 8e96c2d Compare September 5, 2025 14:58

01mf02 added 6 commits September 6, 2025 10:50

Move writing macros to own module, document design decisions.

7eb837b

Format.

c849d8a

Test that YAML writing preserves invalid UTF-8.

4955ed5

Make fromjson/fromcbor/... return stream of values.

474eb20

Also move named filters from jaq-json to own module.

Print special floats (e.g. NaN, Infinity) like JavaScript.

31214a7

Split format-specific filters into own module.

72ebe4e

01mf02 force-pushed the xml branch from 01c1582 to 72ebe4e Compare September 8, 2025 11:26

01mf02 added 2 commits September 8, 2025 15:35

Preserve invalid UTF-8 in "\(.)".

7bc432b

Make jaq-json depend on std unconditionally.

68c0b43

This is necessary because `"\(.)"` writes JSON via `write::write`, which uses `std::io`. Therefore, we have to depend on `std`.

01mf02 force-pushed the xml branch 2 times, most recently from 1075025 to 665687c Compare September 9, 2025 08:18

01mf02 added 2 commits September 9, 2025 10:19

Correct XML PI output, make a few types public, and create test data.

b54ef30

Add XML tests; move CBOR tests.

a00dfcb

01mf02 force-pushed the xml branch from 665687c to a00dfcb Compare September 9, 2025 08:19

01mf02 added 4 commits September 9, 2025 11:19

Make XML output preserve invalid UTF-8.

24388a5

Avoid lossy UTF-8 conversions in jaq-std.

ee41671

Fail for all non-UTF-8 strings in TOML.

21181b8

Handle TOML analogously to XML.

21d0abe

01mf02 force-pushed the xml branch from 8554c23 to aefa7c1 Compare September 10, 2025 18:46

01mf02 added 2 commits September 10, 2025 20:54

Allow indexing strings/arrays with an index > isize::MAX.

d87d216

Thanks, clippy!

57ed5ea

01mf02 force-pushed the xml branch from aefa7c1 to 57ed5ea Compare September 10, 2025 18:57

01mf02 mentioned this pull request Sep 12, 2025

Parallel processing with jaq_json #323

Closed

01mf02 merged commit 303f5d7 into main Sep 12, 2025
3 checks passed

01mf02 deleted the xml branch September 12, 2025 08:40

01mf02 mentioned this pull request Sep 25, 2025

BigInt support #249

Closed

pkoppstein mentioned this pull request Oct 1, 2025

Make byteoffset take two arguments. #337

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Binary data, big integers, and CBOR, TOML, XML, YAML support#284

Binary data, big integers, and CBOR, TOML, XML, YAML support#284
01mf02 merged 143 commits intomainfrom
xml

01mf02 commented May 22, 2025 •

edited

Loading

Uh oh!

wader Jun 5, 2025

Uh oh!

01mf02 Jun 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		### Tags or The TAC Architecture

		Tags are represented by "TAC" objects. A TAC object may have the following fields:

Uh oh!

Conversation

01mf02 commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CBOR, TOML, XML, YAML reading/writing

Extended value type

Binary data

Memory-mapped string loading

Constant-time byte string slicing

Byte offset calculation

Raw strings

Implications

JSON syntax

Byte string syntax

Uh oh!

wader Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

01mf02 Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

01mf02 commented May 22, 2025 •

edited

Loading