Conversation
wader
reviewed
Jun 5, 2025
|
|
||
| ### Tags or The TAC Architecture | ||
|
|
||
| Tags are represented by "TAC" objects. A TAC object may have the following fields: |
Contributor
There was a problem hiding this comment.
Seems like a nice mix between friendliness and lossless. Is a jaq invention or standardized somewhere?
Owner
Author
There was a problem hiding this comment.
Thanks. :) It's indeed a "jaq invention". I found this by formulating a few example processing tasks, and then tried to find a nice representation that makes these tasks compact to formulate in jq.
Also move named filters from jaq-json to own module.
This is necessary because `"\(.)"` writes JSON via `write::write`, which uses `std::io`. Therefore, we have to depend on `std`.
1075025 to
665687c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a large number of new features and improvements to jaq.
CBOR, TOML, XML, YAML reading/writing
This PR adds the
--fromand--toCLI arguments to jaq.That allows you, for example, to read a YAML file, run a filter on it, and output its results as CBOR.
Furthermore, jaq now provides a few new filters, namely:
fromcbortocborfromyamltoyamlfromtomltotomlfromxmltoxmlIn addition, this PR also extends
fromjsonto be able to yield multiple (or no) output values. For example,"1 2" | fromjsonyields two values in jaq, namely1and2. In jq, this would fail with an error message. Idem for"" | fromjson, which yields no value in jaq and fails in jq.This change makes the output of
jaq --raw-input 'fromjson' filethe same as the output ofjaq --from json '.' file.This also holds for all other formats, which can also yield an arbitrary number of output values.
Extended value type
This PR extends jaq's value type by:
I added these value variants originally because they are available in YAML/CBOR, but these turn out to be quite useful in general.
Non-string object keys allow using arbitrary values as object keys, e.g.
{null: 0, true: 1, 2: 3, "str": 4, ["arr"]: 5, {}: 6}.This means that you now effectively have access to a high-performance hash map in your jq programs when you run them with jaq! Note that to construct such objects in jq, you will need to surround non-string keys with parentheses, i.e.
{(null): 0, (true): 1, (2): 3, "str": 4, (["arr"]): 5, ({}): 6}.Integers that cannot be represented as machine-sized integers (
isize) are now automatically represented as big integers (BigInt). For example, if you add two normal integers and the result does not fit into a machine-sized integer, it gets calculated as big integer automatically.Binary data
This PR adds support for byte strings to jaq, following the design process in #309. Thanks a lot to @Maxdamantus for the very generous help!
In jq, all strings are UTF-8 strings. This PR generalises the notion of strings to mean "an array of bytes with an interpretation", where the interpretation states whether we want to treat the string as UTF-8 data or just as regular bytes without any meaning attached.
You can create a byte string from a UTF-8 string by
tobytes. This is a constant-time operation if the input is a string.This allows for several characteristics that are not currently available in jq/fq.
Memory-mapped string loading
This is a crucial prerequisite for handling binary data efficiently in a jq implementation.
Currently, neither jq nor fq seem to do this.
We use
tobytesto interpret the raw, slurped input (-Rs) as bytes, which then allows us to use constant-time indexing by bytes (instead of linear-time indexing by characters, done by default for UTF-8 strings in jq).We can see that jaq terminates almost instantly, whereas fq runs out of memory, and jq takes about 45 seconds, until it reports the first byte of the input. Note that because jq does not implement
tobytes, I used regular UTF-8 string slicing as next-best thing to get at least some result.Constant-time byte string slicing
In the next example, we build up a long byte string ("aa...a"), then slice it repeatedly, removing always one element until the string is nearly empty.
Here, we can see that jaq (0.23 seconds) performs significantly faster than fq (12.6 seconds), and doubling the string size doubles jaq's runtime (0.41 seconds), whereas it roughly squares fq's runtime (45.6 seconds). I opened an issue at wader/fq#1168.
Byte offset calculation
jaq adds a
byteoffsetfilter that allows to determine the offset of a string that (at least partially) overlaps in memory with another string. In particular, if we have a string$sand an integer$n, then($s[$n:] | byteoffset($s)) == $n.To the best of my knowledge, such a filter is available neither in jq nor fq.
Raw strings
In addition to UTF-8 and byte strings, this PR also introduces the concept of "raw strings". These are byte strings, but with the twist that they are output "as-is", meaning as if the
--raw-outputflag was used. However,--raw-outputonly impacts "top-level values", meaning that it prints strings part contained in an array or object like a normal string. A raw string, on the other hand, is always printed "as-is", even if it is inside of an array or object. That allows users to customise the output in new ways, e.g. coloring output differently, outputting images on the terminal, or recreate fq's pretty printing for bytes.Implications
The availability of high-performance binary data operations, in particular constant-time loading (via memory mapping) and constant-time slicing makes it feasible to write binary format decoders in the jq language.
That's because parsing usually involves loading data once, followed by frequent input slicing. Having these operations in linear time is a prerequisite for fast parsing.
Furthermore, the new
byteoffsetfilter allows us to quickly determine the range of a string slice with respect to a parent string. That is useful because it relieves us from tracking byte ranges separately, as is done in fq.All this is exciting because it shows the path into a future where binary format decoders can be written once, then be reused between fq and jaq (and perhaps even jq at some point).
JSON syntax
This PR extends the value type used by jaq in order to accommodate values that appear in formats like CBOR and YAML, but cannot be represented by JSON. In particular, this includes objects with non-string keys (e.g.
{1: 2}and binary data.Note, however, that even jq values do not map perfectly one-to-one to JSON values; the most prominent case is
NaN, which jq prints asnull. The idea behind jq's strategy is to map all values to JSON, at the cost of losing information.So far, jaq has followed this path, but the appearance of many more values that cannot be losslessly encoded as JSON has made this strategy increasingly less appealing. Therefore, this PR marks the adoption of a new paradigm: jaq encodes values as valid JSON if and only if they can be encoded losslessly as JSON.
That means that if a value can not be losslessly encoded as valid JSON, it is encoded as invalid JSON. For example,
jaq -nc '[nan, infinite]'now yields[NaN,Infinity](analogous to JavaScript). Furthermore, strings are printed as-is, even if they contain invalid UTF-8 sequences. The idea is the same: if jaq cannot encode a value losslessly as JSON, it encodes the value as invalid JSON.This strategy allows you to find invalid JSON more easily, just by running it through a JSON validator.
(One alternative to this behaviour would be to fail with an error message when trying to print a non-JSON-encodable value. However, for this to work, we would either need to traverse a value twice --- once to validate, once to print --- or construct a JSON-encoded value during validation, resulting in a potentially large allocation. Both alternatives cost time and/or memory, which makes both of them unattractive as default behaviour.)
As a side effect of this new paradigm, jaq's JSON parser is now slightly more liberal in what input it accepts. In particular, jaq's JSON parser now accepts strings containing non-UTF-8 characters (such as the
FFbyte) as well as objects containing non-string keys. The reasoning behind allowing these changes is that they actually make the parser faster and simpler, while continuing to parse valid JSON as before. This is, for the time being, the general policy to determine future changes of the parser. For example, adding comment syntax (//or/* */) is not permissible under this policy, because this would make the parser more complex and slower.Byte string syntax
Byte strings are printed by default with a syntax that is compatible with C and JavaScript, using ASCII characters, escape characters such as
\n,\t, and hexadecimal escape characters such as\xFFfor anything else.This format has been chosen because it is human-readable (in contrast to base64), relatively compact, and machine-processable. For example, the sequence of bytes from 0 to 256 is printed as follows:
To prevent confusion with regular strings, jaq prints byte strings with red color when output to a terminal, whereas it prints regular strings as green.
To get back actual binary data from such a string, there are several options:
Python: Write a file
binary.pywith the contentsimport sys; sys.stdout.buffer.write("\x00..\xFF".encode("latin1"))and run it withpython binary.py.Node: Write a file
binary.jswith the contentsprocess.stdout.write(Buffer.from("\x00..\xff", "latin1"))and run it withnode binary.js.C: Write a file
binary.cwith the contents:Run it with
gcc binary.c -o binary && ./binary.You can verify the output with
xxd, for example.For other formats that support byte strings natively, such as CBOR and YAML, byte strings are printed using their native representation. For example:
TODO:
--fromand--toinhelp.txt!!intetc.)saphyr-parserversion once Using saphyr with Rust 1.65 saphyr-rs/saphyr#50 is closedfromxml,toxml,fromyaml,toyaml