Skip to content

Binary data, big integers, and CBOR, TOML, XML, YAML support#284

Merged
01mf02 merged 143 commits intomainfrom
xml
Sep 12, 2025
Merged

Binary data, big integers, and CBOR, TOML, XML, YAML support#284
01mf02 merged 143 commits intomainfrom
xml

Conversation

@01mf02
Copy link
Copy Markdown
Owner

@01mf02 01mf02 commented May 22, 2025

This PR adds a large number of new features and improvements to jaq.

CBOR, TOML, XML, YAML reading/writing

This PR adds the --from and --to CLI arguments to jaq.
That allows you, for example, to read a YAML file, run a filter on it, and output its results as CBOR.
Furthermore, jaq now provides a few new filters, namely:

  • fromcbor
  • tocbor
  • fromyaml
  • toyaml
  • fromtoml
  • totoml
  • fromxml
  • toxml

In addition, this PR also extends fromjson to be able to yield multiple (or no) output values. For example, "1 2" | fromjson yields two values in jaq, namely 1 and 2. In jq, this would fail with an error message. Idem for "" | fromjson, which yields no value in jaq and fails in jq.
This change makes the output of jaq --raw-input 'fromjson' file the same as the output of jaq --from json '.' file.
This also holds for all other formats, which can also yield an arbitrary number of output values.

Extended value type

This PR extends jaq's value type by:

  • binary data (see below),
  • non-string object keys, and
  • big integers.

I added these value variants originally because they are available in YAML/CBOR, but these turn out to be quite useful in general.

Non-string object keys allow using arbitrary values as object keys, e.g. {null: 0, true: 1, 2: 3, "str": 4, ["arr"]: 5, {}: 6}.
This means that you now effectively have access to a high-performance hash map in your jq programs when you run them with jaq! Note that to construct such objects in jq, you will need to surround non-string keys with parentheses, i.e. {(null): 0, (true): 1, (2): 3, "str": 4, (["arr"]): 5, ({}): 6}.

Integers that cannot be represented as machine-sized integers (isize) are now automatically represented as big integers (BigInt). For example, if you add two normal integers and the result does not fit into a machine-sized integer, it gets calculated as big integer automatically.

Binary data

This PR adds support for byte strings to jaq, following the design process in #309. Thanks a lot to @Maxdamantus for the very generous help!

In jq, all strings are UTF-8 strings. This PR generalises the notion of strings to mean "an array of bytes with an interpretation", where the interpretation states whether we want to treat the string as UTF-8 data or just as regular bytes without any meaning attached.

You can create a byte string from a UTF-8 string by tobytes. This is a constant-time operation if the input is a string.
This allows for several characteristics that are not currently available in jq/fq.

Memory-mapped string loading

This is a crucial prerequisite for handling binary data efficiently in a jq implementation.
Currently, neither jq nor fq seem to do this.

$ cargo run -- -Rs 'tobytes | .[:1]' ~/Downloads/ubuntu-22.04.5-desktop-amd64.iso
"\xeb"
0.07user 0.03system 0:00.11elapsed
$ /usr/bin/time fq -Rs 'tobytes | .[:1]' ~/Downloads/ubuntu-22.04.5-desktop-amd64.iso
Command terminated by signal 9 <---- ran out of RAM
6.69user 37.50system 0:54.94elapsed
$ /usr/bin/time jq -Rs '.[:1]' ~/Downloads/ubuntu-22.04.5-desktop-amd64.iso 
"�"
45.85user 2.50system 0:59.29elapsed

We use tobytes to interpret the raw, slurped input (-Rs) as bytes, which then allows us to use constant-time indexing by bytes (instead of linear-time indexing by characters, done by default for UTF-8 strings in jq).
We can see that jaq terminates almost instantly, whereas fq runs out of memory, and jq takes about 45 seconds, until it reports the first byte of the input. Note that because jq does not implement tobytes, I used regular UTF-8 string slicing as next-best thing to get at least some result.

Constant-time byte string slicing

In the next example, we build up a long byte string ("aa...a"), then slice it repeatedly, removing always one element until the string is nearly empty.

$ /usr/bin/time cargo run --release -q -- -n '[limit(100000; repeat("a" | tobytes))] | add | last(recurse(.[1:]; length > 0))'
"a"
0.23user 0.04system 0:00.29elapsed 97%CPU [...]
$ /usr/bin/time cargo run --release -q -- -n '[limit(200000; repeat("a" | tobytes))] | add | last(recurse(.[1:]; length > 0))'
"a"
0.41user 0.03system 0:00.45elapsed 99%CPU [...]
$ /usr/bin/time fq -n '[limit(100000; repeat("a" | tobytes))] | add | last(recurse(.[1:]; length > 0))'
"a"
12.60user 0.20system 0:07.05elapsed [...]
$ /usr/bin/time fq -n '[limit(200000; repeat("a" | tobytes))] | add | last(recurse(.[1:]; length > 0))'
"a"
45.63user 0.39system 0:25.77elapsed [...]

Here, we can see that jaq (0.23 seconds) performs significantly faster than fq (12.6 seconds), and doubling the string size doubles jaq's runtime (0.41 seconds), whereas it roughly squares fq's runtime (45.6 seconds). I opened an issue at wader/fq#1168.

Byte offset calculation

jaq adds a byteoffset filter that allows to determine the offset of a string that (at least partially) overlaps in memory with another string. In particular, if we have a string $s and an integer $n, then ($s[$n:] | byteoffset($s)) == $n.
To the best of my knowledge, such a filter is available neither in jq nor fq.

Raw strings

In addition to UTF-8 and byte strings, this PR also introduces the concept of "raw strings". These are byte strings, but with the twist that they are output "as-is", meaning as if the --raw-output flag was used. However, --raw-output only impacts "top-level values", meaning that it prints strings part contained in an array or object like a normal string. A raw string, on the other hand, is always printed "as-is", even if it is inside of an array or object. That allows users to customise the output in new ways, e.g. coloring output differently, outputting images on the terminal, or recreate fq's pretty printing for bytes.

Implications

The availability of high-performance binary data operations, in particular constant-time loading (via memory mapping) and constant-time slicing makes it feasible to write binary format decoders in the jq language.
That's because parsing usually involves loading data once, followed by frequent input slicing. Having these operations in linear time is a prerequisite for fast parsing.
Furthermore, the new byteoffset filter allows us to quickly determine the range of a string slice with respect to a parent string. That is useful because it relieves us from tracking byte ranges separately, as is done in fq.

All this is exciting because it shows the path into a future where binary format decoders can be written once, then be reused between fq and jaq (and perhaps even jq at some point).

JSON syntax

This PR extends the value type used by jaq in order to accommodate values that appear in formats like CBOR and YAML, but cannot be represented by JSON. In particular, this includes objects with non-string keys (e.g. {1: 2} and binary data.
Note, however, that even jq values do not map perfectly one-to-one to JSON values; the most prominent case is NaN, which jq prints as null. The idea behind jq's strategy is to map all values to JSON, at the cost of losing information.
So far, jaq has followed this path, but the appearance of many more values that cannot be losslessly encoded as JSON has made this strategy increasingly less appealing. Therefore, this PR marks the adoption of a new paradigm: jaq encodes values as valid JSON if and only if they can be encoded losslessly as JSON.

That means that if a value can not be losslessly encoded as valid JSON, it is encoded as invalid JSON. For example, jaq -nc '[nan, infinite]' now yields [NaN,Infinity] (analogous to JavaScript). Furthermore, strings are printed as-is, even if they contain invalid UTF-8 sequences. The idea is the same: if jaq cannot encode a value losslessly as JSON, it encodes the value as invalid JSON.
This strategy allows you to find invalid JSON more easily, just by running it through a JSON validator.
(One alternative to this behaviour would be to fail with an error message when trying to print a non-JSON-encodable value. However, for this to work, we would either need to traverse a value twice --- once to validate, once to print --- or construct a JSON-encoded value during validation, resulting in a potentially large allocation. Both alternatives cost time and/or memory, which makes both of them unattractive as default behaviour.)

As a side effect of this new paradigm, jaq's JSON parser is now slightly more liberal in what input it accepts. In particular, jaq's JSON parser now accepts strings containing non-UTF-8 characters (such as the FF byte) as well as objects containing non-string keys. The reasoning behind allowing these changes is that they actually make the parser faster and simpler, while continuing to parse valid JSON as before. This is, for the time being, the general policy to determine future changes of the parser. For example, adding comment syntax (// or /* */) is not permissible under this policy, because this would make the parser more complex and slower.

Byte string syntax

Byte strings are printed by default with a syntax that is compatible with C and JavaScript, using ASCII characters, escape characters such as \n, \t, and hexadecimal escape characters such as \xFF for anything else.
This format has been chosen because it is human-readable (in contrast to base64), relatively compact, and machine-processable. For example, the sequence of bytes from 0 to 256 is printed as follows:

$ jaq -n '[range(256)] | tobytes'
"\x00\x01\x02\x03\x04\x05\x06\x07\b\t\n\x0b\f\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"

To prevent confusion with regular strings, jaq prints byte strings with red color when output to a terminal, whereas it prints regular strings as green.

To get back actual binary data from such a string, there are several options:

  • Python: Write a file binary.py with the contents import sys; sys.stdout.buffer.write("\x00..\xFF".encode("latin1")) and run it with python binary.py.

  • Node: Write a file binary.js with the contents process.stdout.write(Buffer.from("\x00..\xff", "latin1")) and run it with node binary.js.

  • C: Write a file binary.c with the contents:

    #include <stdio.h>
    void main() {
        const char data[] = "\x00..\xff";
        fwrite(data, (sizeof data) - 1, 1, stdout);
    }

    Run it with gcc binary.c -o binary && ./binary.

You can verify the output with xxd, for example.

For other formats that support byte strings natively, such as CBOR and YAML, byte strings are printed using their native representation. For example:

$ jaq -n --to yaml '[range(32)] | tobytes'
---
!!binary AAECAwQFBgcICQoLDA0ODxAREhMUFRYXGBkaGxwdHh8=
...

TODO:

  • Document --from and --to in help.txt
  • Document YAML behaviour (supported tags !!int etc.)
  • Use new saphyr-parser version once Using saphyr with Rust 1.65 saphyr-rs/saphyr#50 is closed
  • Implement fromxml, toxml, fromyaml, toyaml
  • Encapsulate lexer errors in newtype to ease updates
  • Recover shared YAML values
  • Write parsing tests


### Tags or The TAC Architecture

Tags are represented by "TAC" objects. A TAC object may have the following fields:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a nice mix between friendliness and lossless. Is a jaq invention or standardized somewhere?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. :) It's indeed a "jaq invention". I found this by formulating a few example processing tasks, and then tried to find a nice representation that makes these tasks compact to formulate in jq.

This is necessary because `"\(.)"` writes JSON via `write::write`,
which uses `std::io`. Therefore, we have to depend on `std`.
@01mf02 01mf02 force-pushed the xml branch 2 times, most recently from 1075025 to 665687c Compare September 9, 2025 08:18
@01mf02 01mf02 merged commit 303f5d7 into main Sep 12, 2025
3 checks passed
@01mf02 01mf02 deleted the xml branch September 12, 2025 08:40
@01mf02 01mf02 mentioned this pull request Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants