-
Notifications
You must be signed in to change notification settings - Fork 30
Description
TL;DR
TES doesn’t define how to safely mount input/output paths, which can cause file overrides in Kubernetes (e.g., mounting /data can break containers).
TES defines inputs and outputs, which indicate the what — i.e., which files are required or produced by the task. However, I believe the how — especially around file path handling and mounting semantics — is underspecified, particularly in Kubernetes environments. Happy to be proven wrong if I’ve missed something.
Context: Kubernetes & File Mounting
In Kubernetes, mounting files for TES into a container is more nuanced than it seems. Suppose we want to mount /data
in a pod, but the container image already has essential runtime files at /data
. In that case, the mount operation overrides the entire directory, potentially erasing those runtime files and breaking the executor logic — leading to task failure.
TES allows arbitrary paths for inputs and outputs — nothing prevents users from declaring something like /root/.bashrc
as an output. But to extract that file post-execution, the entire /
would have to be mounted, which Kubernetes' security model disallows. Interestingly, input files can be mounted at the root (e.g., an input tes.config
at /tes.config
) using mountPath
and subPath
.
The key insight is that inputs and outputs behave very differently, especially regarding mount collisions and overrides. TES currently treats them symmetrically in the spec, but their implementation requires different handling.
Mounting Complexity in Practice
When mounting paths for inputs
, outputs
, and volumes
, collisions and overrides must be carefully managed. Here's what I've noticed:
- Volumes and inputs are generally safe to mount unless they collide with essential image directories.
- Outputs are riskier — if not handled carefully, they can override critical paths in the container filesystem.
Mount Strategy
I’ve experimented with several strategies. Here's what worked best:
-
Naive approach
Mount the top-level parent directory for every declared path. This is too coarse and risks overriding system files. -
Improved approach
A more refined, collision-safe strategy:- Mount volumes first
- We begin by adding all volume paths to a trie structure. These are usually shared directories, and we create mount paths for them first.
- Handle outputs next
- We insert all output file paths into the trie and walk it to determine the minimal set of non-overlapping parent directories to mount.
- Example:
- Output 1:
/output/raw/files/imp_file.txt
- Output 2:
/data/raw/configs/tes.config
- These paths don’t conflict and can be mounted separately.
- If another output requests
/data/raw/raw.bam
, we collapse the mount to/data/raw/
to avoid collisions.
- Output 1:
- Then handle inputs
- We insert input file paths into the trie, checking for overlaps with existing mounts (volumes and outputs).
- For any paths not already covered, we add new mounts.
- If individual input files remain unmounted (e.g., files within an already-mounted directory), we mount them directly using
subPath
.
This approach results in a clean, minimal, and conflict-free list ofmountPaths
that avoids overwriting runtime files.
- Mount volumes first
Suggestion
Ideally, executors should operate in isolated working directories — like a dedicated workdir
— where all files are written and collected from. Volumes can then be used explicitly for inter-executor file sharing. This simplifies mounting logic and avoids unintended file system conflicts.
Impact
For consumers who treat TES as a black box and assume mounting “just works,” these subtle path-based issues can result in broken containers or data loss — without clear feedback. These behaviors should be explicitly documented.
Proposal
Update the specification to clarify path-mounting behavior, especially:
- Must: Output should never contain a root level file.
- Recommendation: Outputs should be scoped to a
workdir
(e.g.,/tes/output
) unless explicitly overridden. - Warning: Implementations must ensure output paths don’t collide with image directories or input/volume paths.
- Note: Inputs and outputs should not be treated symmetrically from a mounting perspective in implementations.
Would love to hear thoughts from other implementers — especially those using TES in Kubernetes contexts — on how they manage path mounting safely, maybe I missing something.