MILAB-6319: opt-in deterministic report input paths (dedup efficiency)#2118
Draft
PaulNewling wants to merge 1 commit into
Draft
MILAB-6319: opt-in deterministic report input paths (dedup efficiency)#2118PaulNewling wants to merge 1 commit into
PaulNewling wants to merge 1 commit into
Conversation
The align report can record input file names instead of absolute paths, so output bytes such as .clns no longer depend on the working directory. Off by default.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Opt-in env var
MI_REPORT_RELATIVE_INPUT_FILES. Whentrue, the align command records input file basenames instead of absolute paths, so a.clnsproduced from identical input is byte-identical regardless of the working directory it ran in. Off by default (current absolute-path behavior preserved).Why — deduplication efficiency
MiXCR records absolute input paths in the report embedded in
.vdjca/.clns. Those paths include the per-invocation working directory, so the same analysis produces byte-different output files across runs / machines. For content-addressed caching (e.g. Platforma), a re-produced.clnsthen fails to dedup against an identical earlier one — it re-uploads / re-processes. Recording basenames makes the bytes working-directory-independent, so identical clone data dedups.What this is NOT
This is a dedup-efficiency improvement, not a fix for the Platforma
CIDConflictErrororiginally chased under MILAB-6319. Investigation showed that conflict's root is non-deterministicjson.encodemap ordering inside the Platforma SDK, fixed separately on the Platforma side. This MiXCR change is complementary: it helps genuine cache-miss / cross-machine re-runs dedup; it is not required to resolve the conflict.Change
One call site,
CommandAlign(where the report's input files are set). File reading is unaffected — only the recorded report value changes. The same wrap applies to the other command report builders (assemble, …) for full coverage.Verification
Not built/tested locally (no private Maven creds in my environment) — relying on CI to compile. To confirm: run
mixcr analyze <preset> R1 R2 outtwice in different working directories withMI_REPORT_RELATIVE_INPUT_FILES=true; the twoout.clnsshould be byte-identical.Context
Part of Platforma
mixcr-amplicon-alignmentCID work (MILAB-6319).