Skip to content

epic: builds.json is large (somewhat by design) #138

@tianon

Description

@tianon

At the current moment, our production builds.json file is sitting at 55.2MiB, which is kind of large for a JSON file (and large enough that GitHub complains about it on every push, although not large enough that they block it -- that limit is 100MiB). 🤏

So while this isn't an urgent problem, it's certainly something we should be looking at and thinking about how to scale further, especially as the system grows and contains more data. 👀

low hanging fruit (change almost nothing)

tabs (--tab)

If we do nothing but switch from (jq's default) two-space indent to tabs, we drop to 49MiB, which is a large enough drop that it's probably worth doing even though it's not a silver bullet. ➡️

(all further serious ideas will assume a baseline of swapping to tabs for size comparison, if keeping a monolithic file)

At our current length of builds.json (7708), that's ~6.5KiB per build object.

If that ratio holds without change, we can fit ~15737 of these inside GitHub's 100MiB limit (that's slightly more than twice as many builds total as we currently have).

With a work-in-progress prototype of some future work that I have (which needs to store more data inside builds.json), I've apparently increased this to ~7.5KiB, which takes that "GitHub maximum" number down to ~13619 (still ~1.8x our current count).

zero indentation

If we were willing to sacrifice readability, we could strip all the leading whitespace and get down to 42.7MiB, but IMO that's a pretty hefty sacrifice (I frequently open the raw file and look at it, especially while debugging, but I can absolutely adjust my workflows if I need to).

non-Git-friendly presentations

To keep this file in Git, we do need to stick to a line-based presentation, and so we might as well stick to the formatted display, but just for completeness:

--compact-output

If we did --compact-output we get all the way down to 40.8MiB.

"worse is better" ? (gzip)

If we get worse and gzip it (requiring even more work to do anything with it because jq can't handle a gzip'd JSON file directly), we can get down to 4.4MiB (tabs) or 4.2MiB (compact), but that's honestly an insane thought -- just giving the ol' gzip a shout-out because this is literally a use case it's really, really good at. 🫶

adjusting other workflows / "breaking" changes

strip out sources.json duplication

The current structure of builds.json was designed so that for a given build, essentially all the information we need exists directly in the one "build" object (ie, classic NoSQL schema design - design your data to look like how you'll need to query it). This means the sources.json content (8.3MiB total, 7.1MiB if we apply the same --tab treatment to it) is all duplicated inside builds.json, and in many cases that content gets repeated because we're intentionally exploded by architecture.

If we remove that source duplication and require loading the source data separately (--slurpfile sources sources.json, for example, assuming that file doesn't grow extremely large over time too), and change/flatten nothing else (del(.[].source)), that takes us to 38.7MiB. If we also then flatten the build objects (remove the then-unnecessary .build indirection; map_values(. * .build | del(.source, .build))), we get down to 37.9MiB.

(This result surprises me a little, because I thought we'd get more bang than 49 - 37.9 out of removing the duplication, but I guess it's pretty slim; only ~156% of sources.json's 7.1MiB in there.)

For comparison, that's drops us to ~5KiB per build object (from ~6.5KiB above), and ~6.3KiB per in my new PoC (down from ~7.5KiB), so this essentially gives us a little bit of space to play with for adding more data and still have 2x headroom on our current build count.

multiple files

A seemingly obvious answer (and actually how the first designs of this system were structured!) is to split builds into separate files. This is essentially trading the size issue for a discovery/traversal issue instead.

There are various ways we can mitigate the traversal issues, like maintaining symlink farms, but IMO the gains of doing all that work are pretty marginal.

In general, reading multiple files via jq is mostly easy enough, even if a little awkward (jq '...' *.json, --slurp, etc), but it does make any workflows that need to write data complicated because now it can't be jq '...' builds.json > some-new-file.json without potentially reintroducing this problem (if we solve it via multiple files).

For example, this means that ./cmd/builds (builds.sh) that's responsible for creating builds.json no longer simply has one canonical output, but now has to write individual JSON documents to separate files (or we have to have a shell script wrapper that does that, which is kind of heinous).

closing thoughts

After compiling all this data, I think my own opinion is that we should definitely swap to tabs, as I've noted above.

If we feel like that's not enough (for now?), I think we should consider dropping the sources.json data from builds.json and/or dropping more of that if there's parts of it that aren't necessary or could be represented in a better way (but both of those are changes that require more careful coordination because we might have other things consuming this data in various ways outside of this repository).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions