Skip to content

MILAB-6319: opt-in deterministic report input paths (dedup efficiency)#2118

Draft
PaulNewling wants to merge 1 commit into
developfrom
MILAB-6319_mixcr-deterministic-input-paths
Draft

MILAB-6319: opt-in deterministic report input paths (dedup efficiency)#2118
PaulNewling wants to merge 1 commit into
developfrom
MILAB-6319_mixcr-deterministic-input-paths

Conversation

@PaulNewling
Copy link
Copy Markdown

@PaulNewling PaulNewling commented May 29, 2026

What

Opt-in env var MI_REPORT_RELATIVE_INPUT_FILES. When true, the align command records input file basenames instead of absolute paths, so a .clns produced from identical input is byte-identical regardless of the working directory it ran in. Off by default (current absolute-path behavior preserved).

Why — deduplication efficiency

MiXCR records absolute input paths in the report embedded in .vdjca / .clns. Those paths include the per-invocation working directory, so the same analysis produces byte-different output files across runs / machines. For content-addressed caching (e.g. Platforma), a re-produced .clns then fails to dedup against an identical earlier one — it re-uploads / re-processes. Recording basenames makes the bytes working-directory-independent, so identical clone data dedups.

What this is NOT

This is a dedup-efficiency improvement, not a fix for the Platforma CIDConflictError originally chased under MILAB-6319. Investigation showed that conflict's root is non-deterministic json.encode map ordering inside the Platforma SDK, fixed separately on the Platforma side. This MiXCR change is complementary: it helps genuine cache-miss / cross-machine re-runs dedup; it is not required to resolve the conflict.

Change

One call site, CommandAlign (where the report's input files are set). File reading is unaffected — only the recorded report value changes. The same wrap applies to the other command report builders (assemble, …) for full coverage.

Verification

Not built/tested locally (no private Maven creds in my environment) — relying on CI to compile. To confirm: run mixcr analyze <preset> R1 R2 out twice in different working directories with MI_REPORT_RELATIVE_INPUT_FILES=true; the two out.clns should be byte-identical.

Context

Part of Platforma mixcr-amplicon-alignment CID work (MILAB-6319).

The align report can record input file names instead of absolute paths, so output bytes such as .clns no longer depend on the working directory. Off by default.
@notion-workspace
Copy link
Copy Markdown

@PaulNewling PaulNewling changed the title MILAB-6319: opt-in env var for deterministic report input paths MILAB-6319: opt-in deterministic report input paths (dedup efficiency) May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant