gen-manifest: optimize checksum calculation#2382
Conversation
thozza
left a comment
There was a problem hiding this comment.
Thank you for the improvements.
The incremental sync machinery (loadExisting, syncToDisk, the existing map/mutex, and the .manifests marker file) feels like it's over-engineering the problem. Let me explain what I mean.
The current approach reads all ~2,200 existing checksum files into memory, compares them with the newly generated ones, skips writes for unchanged files, and conditionally deletes stale ones -- but only if the .manifests marker is present, otherwise it just warns.
I think we can replace all of that with a simple rm -rf "${checksums_dir}" && mkdir -p "${checksums_dir}" in gen-manifest-checksums.sh before calling gen-manifests, and then just unconditionally write all new checksums. Here's why:
- The measured benefit is marginal. The commit messages show that the incremental sync saves ~0.8s (5.8s → 5.0s), and even that number conflates the sync logic with the removal of
rm -rf. The real win in this series is computing checksums in Go instead of shelling out tosha1sum(10s → 5.8s) -- that's great and uncontroversial. But ~0.8s doesn't justify three commits of new code with two mutex-guarded maps, a marker file, and a directory diffing algorithm. - The
.manifestsmarker guards against a scenario that doesn't really exist. The shell script hardcodes the output path -- nobody is accidentally pointing--outputat/usr. And if they did,rm -rfwould be equally dangerous, so the marker doesn't add real safety. - Failure atomicity isn't a real concern here. If manifest generation fails, we have a bug somewhere and CI should catch it through other checks. And locally,
git checkoutorgit reset --hardtrivially restores the checksums directory. We don't need code-level protection against a scenario that has a one-command recovery. - The git stat cache argument is theoretical. Yes, rewriting all files updates their mtime and forces git to re-hash them on the next
git status. But we're talking about 2,200 tiny files (one hex string each) -- git re-hashing those is near-instant. I'd want to see a measurement showing this actually matters before adding complexity to optimize for it. - Stale file cleanup becomes trivial. Instead of diffing two maps and deleting orphans, you just start with a clean directory. Checksums that no longer exist simply don't get written.
This would let us drop loadExisting(), syncToDisk(), checksumMarkerPresent(), the existing map and its mutex, and the .manifests marker file entirely. The Checksums type would just need new and a simple writeAll() method. Significantly less code to maintain for the same end result.
| return nil | ||
| } | ||
|
|
||
| func recordManifestChecksum( |
There was a problem hiding this comment.
This could have been a Checksum type method.
|
I forgot to say that while the manifests didn't change functionally, they technically changed due to using the compact JSON. This means that you could have moved to sha256 from sha1 while at it... |
Use runtime.NumCPU()+1 as the default -workers value. Co-authored-by: Cursor <cursoragent@cursor.com>
Run gen-manifests N times (default 10) into /tmp/test-manifest-checksums/<n>, compare each run to the previous as soon as it finishes, print per-run timing, and remove the tree when all outputs match. Uses the same mock flags as gen-manifest-checksums.sh but writes full manifest JSON files. Co-authored-by: Cursor <cursoragent@cursor.com>
Add w io.Writer and indent bool parameters to save(); open the output file at the callsite and pass the file writer with indent true. Co-authored-by: Cursor <cursoragent@cursor.com>
Add -checksums-only to hash manifest JSON via save() into a buffer, record digests in a sync.Map, write checksum files from workers, and delete stale files after all jobs succeed. Update gen-manifest-checksums.sh to use the flag and drop the temporary manifest tree and sha1sum loop. Roughly 2× faster end-to-end (about 10s → about 4s).
Switch manifestChecksum from SHA-1 to SHA-256. Also, flip the indentation flag for the checksum files. Indentation for checksuming is not needed.
|
Vastly simplified the solution, I agree with your points but I would really like get rid of I edited the PR description, but essentially is just summary of commit messages. |
avitova
left a comment
There was a problem hiding this comment.
I am not sure why we are using SHA-256 in the last commit all of a sudden? Could you explain and maybe also add that information to the comment?
| var metadata, skipNoconfig, skipNorepos, buildconfigAllowUnknown bool | ||
| flag.StringVar(&outputDir, "output", "test/data/manifests/", "manifest store directory") | ||
| flag.IntVar(&nWorkers, "workers", 16, "number of workers to run concurrently") | ||
| flag.IntVar(&nWorkers, "workers", runtime.NumCPU()+1, "number of workers to run concurrently") |
Edit: The initial version was more complex, I rewrote the PR description to reflect a new version.
This patch series speeds up
tools/gen-manifest-checksums.shby hashing manifests insidecmd/gen-manifestsinstead of writing a full temp tree and runningsha1sumper file. Checksum files are updated incrementally (norm -rfon the output directory).Problem statement
Manifest generation via
tools/gen-manifest-checksums.shwas slow becausegen-manifestswrote every mock manifest into a temporary directory, then the shell ransha1sumonce per file. The script also deleted the entire checksum directory withrm -rfon every run even when only a few digests changed.Solution
With
-checksums-only, workers marshal each manifest in memory, compute a digest, and write one hex line per image under-outputwhen the content changes. Async.Mapnamedprocessedrecords what was written; after all jobs succeed, stale files in the output directory are removed.save(w, indent, …)is shared between writing JSON manifests and checksum input. The shell script only runsmkdir -pand invokesgen-manifestswith the same mock flags as before.Roughly 2× faster end-to-end on a warm rpmmd cache (about 10s → about 4s). Checksums use SHA-256 over compact JSON (
indent: false).Commits
gen-manifests: default workers to NumCPU+1
Use
runtime.NumCPU()+1as the default-workersvalue.gen-manifests: refactor save() to write via io.Writer
Add
w io.Writerandindent boolparameters tosave(); open the output file at the callsite and pass the file writer withindenttrue.gen-manifests: compute manifest SHA-1 checksums in Go
Add
-checksums-onlyto hash manifest JSON viasave()into a buffer, record digests in async.Map, write checksum files from workers, and delete stale files after all jobs succeed. Updategen-manifest-checksums.shto use the flag and drop the temporary manifest tree andsha1sumloop.Roughly 2× faster end-to-end (about 10s → about 4s).
tools: add test-manifest-checksums.sh stability checker
Run
gen-manifestsN times (default 10) into/tmp/test-manifest-checksums/<n>, compare each run to the previous as soon as it finishes, print per-run timing, and remove the tree when all outputs match. Uses the same mock flags asgen-manifest-checksums.shbut writes full manifest JSON files.gen-manifests: use SHA-256 and remove indentation
Switch
manifestChecksumfrom SHA-1 to SHA-256.Also, flip the indentation flag for the checksum files. Indentation for checksumming is not needed.
I am not regenerating manifests for the last commit for easier GitHub UI review, will be added after reviews.