Clarify Output Mounting Semantics in TES to Avoid Runtime Collisions

TL;DR
TES doesn’t define how to safely mount input/output paths, which can cause file overrides in Kubernetes (e.g., mounting /data can break containers).

---

TES defines **inputs** and **outputs**, which indicate the *what* — i.e., which files are required or produced by the task. However, I believe the *how* — especially around file path handling and mounting semantics — is underspecified, particularly in Kubernetes environments. Happy to be proven wrong if I’ve missed something.

### Context: Kubernetes & File Mounting

In Kubernetes, mounting files for TES into a container is more nuanced than it seems. Suppose we want to mount `/data` in a pod, but the container image already has essential runtime files at `/data`. In that case, the mount operation overrides the entire directory, potentially erasing those runtime files and breaking the executor logic — leading to task failure.

TES allows arbitrary paths for inputs and outputs — nothing prevents users from declaring something like `/root/.bashrc` as an output. But to extract that file post-execution, the entire `/` would have to be mounted, which Kubernetes' security model disallows. Interestingly, input files *can* be mounted at the root (e.g., an input `tes.config` at `/tes.config`) using `mountPath` and `subPath`.

The key insight is that **inputs and outputs behave very differently**, especially regarding mount collisions and overrides. TES currently treats them symmetrically in the spec, but their implementation requires different handling.

### Mounting Complexity in Practice

When mounting paths for `inputs`, `outputs`, and `volumes`, collisions and overrides must be carefully managed. Here's what I've noticed:

* **Volumes and inputs** are generally safe to mount unless they collide with essential image directories.
* **Outputs** are riskier — if not handled carefully, they can override critical paths in the container filesystem.

#### Mount Strategy

I’ve experimented with several strategies. Here's what worked best:

1. **Naive approach**
   Mount the top-level parent directory for every declared path. This is too coarse and risks overriding system files.

2. **Improved approach**
   A more refined, collision-safe strategy:
   1. **Mount volumes first**
      * We begin by adding all volume paths to a **trie** structure. These are usually shared directories, and we create mount paths for them first.
   2. **Handle outputs next**
      * We insert all output file paths into the trie and walk it to determine the minimal set of non-overlapping parent directories to mount.
      * **Example**:
        * Output 1: `/output/raw/files/imp_file.txt`
        * Output 2: `/data/raw/configs/tes.config`
        * These paths don’t conflict and can be mounted separately.
        * If another output requests `/data/raw/raw.bam`, we collapse the mount to `/data/raw/` to avoid collisions.
   3. **Then handle inputs**
      * We insert input file paths into the trie, checking for overlaps with existing mounts (volumes and outputs).
      * For any paths not already covered, we add new mounts.
      * If individual input files remain unmounted (e.g., files within an already-mounted directory), we mount them directly using `subPath`.
This approach results in a clean, minimal, and conflict-free list of `mountPaths` that avoids overwriting runtime files.

### Suggestion

Ideally, executors should operate in isolated working directories — like a dedicated `workdir` — where all files are written and collected from. Volumes can then be used explicitly for inter-executor file sharing. This simplifies mounting logic and avoids unintended file system conflicts.

### Impact

For consumers who treat TES as a black box and assume mounting “just works,” these subtle path-based issues can result in broken containers or data loss — without clear feedback. These behaviors should be explicitly documented.

### Proposal

Update the specification to clarify path-mounting behavior, especially:

* **Must**: Output should never contain a root level file.
* **Recommendation**: Outputs should be scoped to a `workdir` (e.g., `/tes/output`) unless explicitly overridden.
* **Warning**: Implementations must ensure output paths don’t collide with image directories or input/volume paths.
* **Note**: Inputs and outputs should not be treated symmetrically from a mounting perspective in implementations.

Would love to hear thoughts from other implementers — especially those using TES in Kubernetes contexts — on how they manage path mounting safely, maybe I missing something.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarify Output Mounting Semantics in TES to Avoid Runtime Collisions #238

Context: Kubernetes & File Mounting

Mounting Complexity in Practice

Mount Strategy

Suggestion

Impact

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarify Output Mounting Semantics in TES to Avoid Runtime Collisions #238

Description

Context: Kubernetes & File Mounting

Mounting Complexity in Practice

Mount Strategy

Suggestion

Impact

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions