epic: `builds.json` is large (somewhat by design)

At the current moment, [our production `builds.json` file](https://github.com/docker-library/meta/blob/2c3294bacf13162880d25de18f2390936db26877/builds.json) is sitting at **55.2MiB**, which is kind of large for a JSON file (and large enough that GitHub complains about it on every push, although not large enough that they block it -- [that limit is 100MiB](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github)). 🤏

So while this isn't an _urgent_ problem, it's certainly something we should be looking at and thinking about how to scale further, especially as the system grows and contains _more_ data. :eyes:

## low hanging fruit (change almost nothing)

### tabs (`--tab`)

If we do nothing but switch from (`jq`'s default) two-space indent to tabs, we drop to **49MiB**, which is a large enough drop that it's probably worth doing even though it's not a silver bullet. ➡️

(all further serious ideas will assume a baseline of swapping to tabs for size comparison, if keeping a monolithic file)

At our current `length` of `builds.json` (7708), that's ~6.5KiB per build object.

If that ratio holds without change, we can fit ~15737 of these inside GitHub's 100MiB limit (that's slightly more than twice as many builds total as we currently have).

With a work-in-progress prototype of some future work that I have (which needs to store more data inside `builds.json`), I've apparently increased this to ~7.5KiB, which takes that "GitHub maximum" number down to ~13619 (still ~1.8x our current count).

### zero indentation

If we were willing to sacrifice readability, we could strip all the leading whitespace and get down to **42.7MiB**, but IMO that's a pretty hefty sacrifice (I frequently open the raw file and look at it, especially while debugging, but I can absolutely adjust my workflows if I need to).

### non-Git-friendly presentations

To keep this file in Git, we do need to stick to a line-based presentation, and so we might as well stick to the formatted display, but just for completeness:

#### `--compact-output`

If we did `--compact-output` we get all the way down to **40.8MiB**.

#### "worse is better" ? (`gzip`)

If we get worse and `gzip` it (requiring even more work to do _anything_ with it because `jq` can't handle a `gzip`'d JSON file directly), we can get down to **4.4MiB** (tabs) or **4.2MiB** (compact), but that's honestly an insane thought -- just giving the ol' `gzip` a shout-out because this is literally a use case it's really, really good at. 🫶

## adjusting other workflows / "breaking" changes

### strip out `sources.json` duplication

The current structure of `builds.json` was designed so that for a given build, essentially _all_ the information we need exists directly in the one "build" object (ie, classic NoSQL schema design - design your data to look like how you'll need to query it).  This means the `sources.json` content (8.3MiB total, 7.1MiB if we apply the same `--tab` treatment to it) is all duplicated inside `builds.json`, and in many cases that content gets repeated because we're intentionally exploded by architecture.

If we remove that `source` duplication and require loading the source data separately (`--slurpfile sources sources.json`, for example, assuming that file doesn't grow extremely large over time too), and change/flatten nothing else (`del(.[].source)`), that takes us to **38.7MiB**.  If we also then flatten the build objects (remove the then-unnecessary `.build` indirection; `map_values(. * .build | del(.source, .build))`), we get down to **37.9MiB**.

(This result surprises me a little, because I thought we'd get more bang than `49 - 37.9` out of removing the duplication, but I guess it's pretty slim; only ~156% of `sources.json`'s 7.1MiB in there.)

For comparison, that's drops us to ~5KiB per build object (from ~6.5KiB above), and ~6.3KiB per in my new PoC (down from ~7.5KiB), so this essentially gives us a little bit of space to play with for adding more data and still have 2x headroom on our current build count.

### multiple files

A seemingly obvious answer (and actually how the first designs of this system were structured!) is to split builds into separate files.  This is essentially trading the size issue for a discovery/traversal issue instead.

There are various ways we can mitigate the traversal issues, like maintaining symlink farms, but IMO the gains of doing all that work are pretty marginal.

In general, _reading_ multiple files via `jq` is mostly easy enough, even if a little awkward (`jq '...' *.json`, `--slurp`, etc), but it does make any workflows that need to _write_ data complicated because now it can't be `jq '...' builds.json > some-new-file.json` without potentially reintroducing this problem (if we solve it via multiple files).

For example, this means that `./cmd/builds` (`builds.sh`) that's responsible for creating `builds.json` no longer simply has one canonical output, but now _has_ to write individual JSON documents to separate files (or we have to have a shell script wrapper that does that, which is kind of heinous).

## closing thoughts

After compiling all this data, I think my own opinion is that we should definitely swap to tabs, as I've noted above.

If we feel like that's not enough (for now?), I think we should consider dropping the `sources.json` data from `builds.json` and/or dropping more of _that_ if there's parts of it that aren't necessary or could be represented in a better way (but both of those are changes that require more careful coordination because we might have other things consuming this data in various ways outside of this repository).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

epic: `builds.json` is large (somewhat by design) #138

low hanging fruit (change almost nothing)

tabs (`--tab`)

zero indentation

non-Git-friendly presentations

`--compact-output`

"worse is better" ? (`gzip`)

adjusting other workflows / "breaking" changes

strip out `sources.json` duplication

multiple files

closing thoughts

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

epic: builds.json is large (somewhat by design) #138

Description

low hanging fruit (change almost nothing)

tabs (--tab)

zero indentation

non-Git-friendly presentations

--compact-output

"worse is better" ? (gzip)

adjusting other workflows / "breaking" changes

strip out sources.json duplication

multiple files

closing thoughts

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

epic: `builds.json` is large (somewhat by design) #138

tabs (`--tab`)

`--compact-output`

"worse is better" ? (`gzip`)

strip out `sources.json` duplication