Skip to content

gen-manifest: optimize checksum calculation#2382

Open
lzap wants to merge 5 commits into
osbuild:mainfrom
lzap:faster-genmf1
Open

gen-manifest: optimize checksum calculation#2382
lzap wants to merge 5 commits into
osbuild:mainfrom
lzap:faster-genmf1

Conversation

@lzap
Copy link
Copy Markdown
Contributor

@lzap lzap commented May 28, 2026

Edit: The initial version was more complex, I rewrote the PR description to reflect a new version.

This patch series speeds up tools/gen-manifest-checksums.sh by hashing manifests inside cmd/gen-manifests instead of writing a full temp tree and running sha1sum per file. Checksum files are updated incrementally (no rm -rf on the output directory).

Problem statement

Manifest generation via tools/gen-manifest-checksums.sh was slow because gen-manifests wrote every mock manifest into a temporary directory, then the shell ran sha1sum once per file. The script also deleted the entire checksum directory with rm -rf on every run even when only a few digests changed.

Solution

With -checksums-only, workers marshal each manifest in memory, compute a digest, and write one hex line per image under -output when the content changes. A sync.Map named processed records what was written; after all jobs succeed, stale files in the output directory are removed. save(w, indent, …) is shared between writing JSON manifests and checksum input. The shell script only runs mkdir -p and invokes gen-manifests with the same mock flags as before.

Roughly 2× faster end-to-end on a warm rpmmd cache (about 10s → about 4s). Checksums use SHA-256 over compact JSON (indent: false).

Commits

gen-manifests: default workers to NumCPU+1

Use runtime.NumCPU()+1 as the default -workers value.

gen-manifests: refactor save() to write via io.Writer

Add w io.Writer and indent bool parameters to save(); open the output file at the callsite and pass the file writer with indent true.

gen-manifests: compute manifest SHA-1 checksums in Go

Add -checksums-only to hash manifest JSON via save() into a buffer, record digests in a sync.Map, write checksum files from workers, and delete stale files after all jobs succeed. Update gen-manifest-checksums.sh to use the flag and drop the temporary manifest tree and sha1sum loop.

Roughly 2× faster end-to-end (about 10s → about 4s).

tools: add test-manifest-checksums.sh stability checker

Run gen-manifests N times (default 10) into /tmp/test-manifest-checksums/<n>, compare each run to the previous as soon as it finishes, print per-run timing, and remove the tree when all outputs match. Uses the same mock flags as gen-manifest-checksums.sh but writes full manifest JSON files.

gen-manifests: use SHA-256 and remove indentation

Switch manifestChecksum from SHA-1 to SHA-256.

Also, flip the indentation flag for the checksum files. Indentation for checksumming is not needed.


I am not regenerating manifests for the last commit for easier GitHub UI review, will be added after reviews.

Copy link
Copy Markdown
Member

@thozza thozza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the improvements.

The incremental sync machinery (loadExisting, syncToDisk, the existing map/mutex, and the .manifests marker file) feels like it's over-engineering the problem. Let me explain what I mean.

The current approach reads all ~2,200 existing checksum files into memory, compares them with the newly generated ones, skips writes for unchanged files, and conditionally deletes stale ones -- but only if the .manifests marker is present, otherwise it just warns.

I think we can replace all of that with a simple rm -rf "${checksums_dir}" && mkdir -p "${checksums_dir}" in gen-manifest-checksums.sh before calling gen-manifests, and then just unconditionally write all new checksums. Here's why:

  • The measured benefit is marginal. The commit messages show that the incremental sync saves ~0.8s (5.8s → 5.0s), and even that number conflates the sync logic with the removal of rm -rf. The real win in this series is computing checksums in Go instead of shelling out to sha1sum (10s → 5.8s) -- that's great and uncontroversial. But ~0.8s doesn't justify three commits of new code with two mutex-guarded maps, a marker file, and a directory diffing algorithm.
  • The .manifests marker guards against a scenario that doesn't really exist. The shell script hardcodes the output path -- nobody is accidentally pointing --output at /usr. And if they did, rm -rf would be equally dangerous, so the marker doesn't add real safety.
  • Failure atomicity isn't a real concern here. If manifest generation fails, we have a bug somewhere and CI should catch it through other checks. And locally, git checkout or git reset --hard trivially restores the checksums directory. We don't need code-level protection against a scenario that has a one-command recovery.
  • The git stat cache argument is theoretical. Yes, rewriting all files updates their mtime and forces git to re-hash them on the next git status. But we're talking about 2,200 tiny files (one hex string each) -- git re-hashing those is near-instant. I'd want to see a measurement showing this actually matters before adding complexity to optimize for it.
  • Stale file cleanup becomes trivial. Instead of diffing two maps and deleting orphans, you just start with a clean directory. Checksums that no longer exist simply don't get written.

This would let us drop loadExisting(), syncToDisk(), checksumMarkerPresent(), the existing map and its mutex, and the .manifests marker file entirely. The Checksums type would just need new and a simple writeAll() method. Significantly less code to maintain for the same end result.

Comment thread cmd/gen-manifests/checksums.go Outdated
return nil
}

func recordManifestChecksum(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could have been a Checksum type method.

@thozza thozza requested review from achilleas-k and supakeen May 28, 2026 18:47
@thozza
Copy link
Copy Markdown
Member

thozza commented May 28, 2026

I forgot to say that while the manifests didn't change functionally, they technically changed due to using the compact JSON. This means that you could have moved to sha256 from sha1 while at it...

lzap and others added 5 commits May 29, 2026 15:03
Use runtime.NumCPU()+1 as the default -workers value.

Co-authored-by: Cursor <cursoragent@cursor.com>
Run gen-manifests N times (default 10) into /tmp/test-manifest-checksums/<n>,
compare each run to the previous as soon as it finishes, print per-run timing,
and remove the tree when all outputs match. Uses the same mock flags as
gen-manifest-checksums.sh but writes full manifest JSON files.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add w io.Writer and indent bool parameters to save(); open the output
file at the callsite and pass the file writer with indent true.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add -checksums-only to hash manifest JSON via save() into a buffer,
record digests in a sync.Map, write checksum files from workers, and
delete stale files after all jobs succeed. Update
gen-manifest-checksums.sh to use the flag and drop the temporary
manifest tree and sha1sum loop.

Roughly 2× faster end-to-end (about 10s → about 4s).
Switch manifestChecksum from SHA-1 to SHA-256.

Also, flip the indentation flag for the checksum files. Indentation for
checksuming is not needed.
@lzap lzap force-pushed the faster-genmf1 branch from 695e619 to 51a9979 Compare May 29, 2026 13:03
@lzap
Copy link
Copy Markdown
Contributor Author

lzap commented May 29, 2026

Vastly simplified the solution, I agree with your points but I would really like get rid of rm -rf ... I often do rebase of many commits and I simply do not want thousand and thousands of files being removed and written over and over again for no reason. I really focused on the simplicity of the solution - it just overwrites existing files, marks them in a synced map and deletes leftovers, no goroutines, no mutexes (technically there has to be one or a channel) and no complex logic (no markers).

I edited the PR description, but essentially is just summary of commit messages.

Copy link
Copy Markdown
Contributor

@avitova avitova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why we are using SHA-256 in the last commit all of a sudden? Could you explain and maybe also add that information to the comment?

Comment thread cmd/gen-manifests/main.go
var metadata, skipNoconfig, skipNorepos, buildconfigAllowUnknown bool
flag.StringVar(&outputDir, "output", "test/data/manifests/", "manifest store directory")
flag.IntVar(&nWorkers, "workers", 16, "number of workers to run concurrently")
flag.IntVar(&nWorkers, "workers", runtime.NumCPU()+1, "number of workers to run concurrently")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the +1 here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants