High-Level Design Document
DistFS is a distributed, end-to-end encrypted file system designed for zero-knowledge privacy. It separates metadata management (strongly consistent via Raft) from data storage (scalable via chunked distribution). The system is designed to provide fs.FS compatibility for Go clients while ensuring that the storage providers (nodes) cannot read the user's data or metadata.
Technical Specification: For the exhaustive Client<->Server protocol contract, refer to SERVER-API.md. For the high-level Go Client API, refer to CLIENT-API.md.
Implementation Plan: For the detailed, step-by-step execution strategy of this design, refer to DISTFS-PLAN.md.
The system uses a unified node architecture to simplify deployment and management.
- Client: The entry point. Handles encryption, chunking, and file tree logic.
- Storage Node: A unified binary that performs two distinct roles:
- Metadata Role: Participates in the Raft consensus group (typically 3-5 nodes) to manage the namespace.
- Data Role: Stores encrypted binary blobs. All nodes in the cluster perform this role, allowing storage capacity to scale horizontally independent of the consensus group.
- SECURITY: End-to-End Encryption (E2EE). Server trusts no one. Access control via cryptography.
- RELIABILITY: Metadata replicated via Raft logs. Data replicated via chunk copying (RepFactor=3).
- SCALABILITY: Metadata separated from data. Data writes go directly to Data Roles.
- LOW LATENCY: Clients cache metadata and stream chunk reads in parallel.
The system is designed to scale horizontally, with the following soft limits:
- File Size: Up to 100 GB (100,000 chunks @ 1MB).
- Design Implication: Large files result in large Inode structures (100k UUIDs ~ 1.6MB). The Metadata Layer splits
ChunkManifestsinto multiple pages to keep individual Raft log entries small.
- Design Implication: Large files result in large Inode structures (100k UUIDs ~ 1.6MB). The Metadata Layer splits
- File Count: Up to 1 Million Files per cluster.
- Design Implication: This results in ~1M keys in the Raft FSM. The
LinkSnapshotStoreandNoSnapshotRestoreOnStartoptimizations are critical for O(1) restarts. Metadata is grouped by "Directory" to allow LRU eviction of inactive trees from memory.
- Design Implication: This results in ~1M keys in the Raft FSM. The
- Data Privacy: All file content is encrypted on the Client side using AES-256-GCM before it is sent to any node.
- Metadata Privacy: Directory names and file names are encrypted. The MetaNode sees the file system as a graph of opaque IDs, not paths like
/home/alice/docs. - Network Privacy: By default, all HTTP clients utilize Encrypted Client Hello (ECH) and DNS-over-HTTPS (DoH) via
github.com/c2FmZQ/ech. This encrypts the SNI during the TLS handshake and the DNS resolution queries, preventing network observers from identifying the specific DistFS clusters or nodes a client is communicating with. (This can be disabled via--disable-dohfor internal environments). - Node Security: Leveraging
github.com/c2FmZQ/storage, all data stored on MetaNodes and DataNodes (Raft logs, snapshots, chunk files, keys) is encrypted at rest.- Root Secret: A
DISTFS_MASTER_KEYenvironment variable provides the master passphrase. - Hardware Binding (Optional): If the
--use-tpmflag is provided, the node will use a local Trusted Platform Module (TPM) to compute an HMAC over theDISTFS_MASTER_KEY. This ensures the local storage backend cannot be decrypted without physical access to the specific hardware that initialized it. - Master Key: A
crypto.MasterKeyis derived from this passphrase (or its TPM HMAC) to decrypt the node-local key store (data/master.key). - ClusterSecret Vault: Each node maintains a local encrypted vault containing the shared ClusterSecret. This vault is protected by the node's unique Master Key.
- Isolation: Encryption keys are node-local and never shared across the network.
- Root Secret: A
- Key Rotation:
- Raft Logs: The encryption key for Raft logs MUST be rotated after every snapshot.
- FSM Metadata: Metadata values in BoltDB are encrypted using a cluster-wide FSM KeyRing. The active key is used for new writes, while old keys are retained for decryption.
- Root of Trust (FSM): Critical metadata anchors (the
FSM KeyRingandClusterSignKey) are stored in the BoltDBsystembucket, encrypted with a key derived from the ClusterSecret. This ensures that the rotating KeyRing can be safely stored within the FSM itself.
Users control their identity via an asymmetric key pair using Post-Quantum Cryptography (PQC) algorithms (e.g., CRYSTALS-Kyber for encapsulation, CRYSTALS-Dilithium for signatures) to future-proof against quantum threats.
- User Identity Key: Public key maps to the User ID. Private key signs requests.
- User ID: Derived from the user's
sub(subject) claim using a cluster-wide HMAC to ensure privacy.
- User ID: Derived from the user's
- Cluster Identity (Epoch Keys): The cluster maintains a rotating set of shared PQC KEM keys ("Epoch Keys") stored in the Raft FSM.
- Shared: All nodes use the same keys to decrypt client requests, enabling stateless load balancing.
- Rotating: Keys rotate periodically (e.g., daily) to provide Forward Secrecy. Old keys are securely erased from memory and disk.
- Group Identity Key: A rotating Public/Private signing key pair (ML-DSA) representing a Group. The Private Signing Key, along with a symmetric Epoch Key (for file encryption), is shared among group members via the Group Lockbox.
- File Key (FK): A random symmetric key generated for each file. Encrypts the file content.
- Lockbox:
- File Lockbox: Stores the
File Keyencrypted for the Owner and/or the assigned Group (using the Group's current Epoch Key). - Group Lockbox: Stores the
Group Private Signing Keyand currentEpoch Keyencrypted for each named member's Public Key. - Anonymous Lockbox: An unordered array of ciphertexts containing the group keys, encrypted for anonymous members.
- World Lockbox: Special entry for ID
world. Allows all registered users to retrieve the World Private Key (encrypted for them) to decrypt or modify "world-accessible" files.
- File Lockbox: Stores the
To minimize PII exposure, the metadata layer operates on opaque identifiers.
- Transient PII: The server processes user identifiers (e.g.,
subclaims during OIDC registration) only momentarily in memory. - Hashed Identifiers: The persistent User ID is
HMAC-SHA256(sub, ClusterSecret).- Cluster Secret: A high-entropy random key generated at cluster bootstrap, stored securely in the Raft FSM, and shared among nodes via mTLS. It never leaves the cluster.
- Implication: Logs, snapshots, and disk storage contain no emails, sub claims, or names.
- No Names: The FSM does not store user names (e.g., "Alice"). Users who wish to share their display name must store it in an encrypted file (e.g.,
/.profile) within the file system itself.
While TLS (Layer 4) protects the connection, DistFS implements Layer 7 End-to-End Encryption for all metadata operations to ensure that infrastructure components cannot observe or tamper with the file system structure.
- Unified Endpoint: All authenticated operations are routed through a single, non-descript route (
POST /v1/invoke) to preserve user privacy and prevent traffic analysis. - Sealed Requests: All mutation and sensitive query requests are wrapped in a
SealedRequestenvelope. The decrypted payload is aSealedEnvelopecontaining the specificAction(e.g.GetInode,Batch) and its parameters. - Sealed Responses: Responses are wrapped in a
SealedResponseenvelope, encrypted for the specific Client. - Replay Protection: Each sealed envelope includes a high-resolution timestamp and is subject to sliding-window nonce verification.
To maximize user privacy and minimize the server's knowledge of the filesystem content, all metadata that is not strictly required for server-side enforcement (e.g., filenames, timestamps, fine-grained ACLs) is consolidated into a single encrypted blob.
- ClientBlob Envelope: An AES-256-GCM encrypted structure stored within the
InodeandGroupobjects. - Encryption Keys:
- Inode: Encrypted with the File Key.
- Group: Encrypted with the Group Encryption Key.
- Encapsulated Fields:
- Inodes: Filenames (
Name), symbolic link targets, modification times (MTime), small file content (InlineData), and POSIX ownership (UID/GID). - Groups: Human-readable group names.
- Inodes: Filenames (
- Integrity & Attribution: The
SignerIDis stored in the publicInodestruct to allow all readers to verify theManifestHashintegrity before attempting decryption. TheClientBlobis included in this signed hash.
To support seamless multi-device usage without compromising the "Trust No One" model, DistFS provides a unified onboarding flow that combines identity initialization, registration, and cloud-backed recovery.
- Unified Onboarding (
initcommand):- New Account (
--new): The client generates PQC identity keys, executes the OAuth2 Device Flow to authenticate via OIDC, registers the keys with the server, encrypts the local configuration, and automatically pushes a synchronization blob to the server. - Existing Account: On a new device, the user runs
initwithout the--newflag. The client authenticates via OIDC, retrieves the encrypted synchronization blob from the server, and restores the local configuration after prompting for the passphrase.
- New Account (
- Client-Side Preparation: The client encrypts its
config.json(containing the PQC Identity and Encryption keys) using a user-provided passphrase and Argon2id KDF. - Passphrase-Encrypted Blob: The server only ever sees the opaque ciphertext (
KeySyncBlob). - Security Enforcement: To prevent unauthorized overwrites, storing or updating a sync blob requires a valid
Session-Tokenand mandatory Layer 7 E2EE (Sealing).
To enhance security during passphrase entry, DistFS supports the Assuan protocol via the pinentry suite of tools.
- Standard Protocol: The client communicates with
pinentrybinaries (e.g.,pinentry-curses,pinentry-qt,pinentry-mac) to securely capture user passphrases. - Environment Integration: Supports
GPG_TTYfor terminal-based entry and respects~/.gnupg/gpg-agent.confconfigurations. - Opt-in Usage: Enabled via the
--use-pinentryflag in CLI and FUSE tools. - Hardened Implementation: Validates input environments and avoids insecure logging of captured passphrases.
To obtain an inode's file key, a client performs a uniform asymmetric trial decryption. Every entry in an Inode Lockbox is encrypted using an asymmetric Post-Quantum algorithm (ML-KEM) for a specific User, Group, or the World.
- Iterate Recipients: The client iterates through all
recipientIDkeys in the inode'sLockboxmap. - Personal Access: If
recipientIDmatches the client'sUserID, the client decapsulates the entry using its personal privateDecKey. - World Access: If
recipientIDisworld, the client fetches the cluster'sWorldPrivateKey(authorized by its user identity) and decapsulates the entry. - Group Access (Hierarchical): If the
recipientIDis aGroupID, the client attempts to derive the Group's private key:- Membership Lookup: The client fetches the Group metadata and computes its stable privacy-preserving identifier:
target = HMAC(GroupID, UserID). - Named Membership: If
targetexists in the Group's lockbox, the client decapsulates it using its personalDecKeyto retrieve theEpochSeed. - Anonymous Membership: If not found, the client performs trial decryption against the Group's
AnonymousLockbox(an unordered array of ciphertexts) using its personalDecKey. - Key Derivation: Once the
EpochSeedis retrieved, the client derives the Group's private key for the requested epoch. - Inode Unlocking: The client uses the derived Group private key to decapsulate the entry in the Inode's lockbox for that
recipientID(the GroupID).
- Membership Lookup: The client fetches the Group metadata and computes its stable privacy-preserving identifier:
- Verification: Once a potential file key is obtained, it is used to decrypt the
ClientBlob. Decryption success and a valid signature check confirm the key is correct.
To improve performance and enable access during network partitions, DistFS implements a persistent, platform-aware caching layer that adheres to the "Trust No One" security model.
- Unified Storage Interface (
KVStore): The client utilizes a common storage abstraction to manage persistent data, ensuring consistent behavior across all platforms. - Cross-Platform Implementations:
- Native (CLI/FUSE): Utilizes BoltDB for structured metadata (inodes, groups, users) and a sharded filesystem structure for encrypted data chunks.
- WASM (Browser): Leverages browser-native IndexedDB to provide persistent storage without requiring a local filesystem.
- Local Encryption at Rest: To maintain the zero-knowledge boundary even for local caches, all data stored in the
KVStoreis encrypted using AES-256-GCM. The encryption key is derived from the user's passphrase (provided duringinitorlogin) via Argon2id, ensuring that an attacker with local disk access cannot inspect the cached file system structure or content. - Tiered Caching (L1/L2):
- L1 (In-Memory): High-speed bounded LRU caches for active sessions.
- L2 (Persistent): Encrypted disk/IndexedDB storage for long-term persistence across restarts.
- Read-Only Offline Mode:
- Fallback: If the metadata server is unreachable or the user explicitly toggles offline mode, the client automatically falls back to the L2 persistent cache.
- Integrity: Cached Inodes, Groups, and Users retain their cryptographic signatures, allowing the client to verify data integrity even when disconnected.
- Write Protection: To prevent split-brain conflicts, all mutation operations are strictly prohibited while in offline mode.
- Strong Metadata Re-validation: When online, the client can efficiently cross-check the freshness of cached items by sending a batch of
(ID, Version)pairs to the server. The server identifies which items have newer versions, allowing the client to perform surgical cache invalidation.
This layer implements a distributed consensus architecture using the Raft protocol.
The Raft FSM stores the "Inode" table and Directory Structure.
-
User Structure:
-
HMAC(sub) -> {UID, ML-KEM PK, ML-DSA PK, Usage, Quota}.
-
-
Group Structure:
-
UUID -> {ID, OwnerID, GID, ML-KEM PK, ML-DSA PK, ClientBlob, Lockbox, AnonymousLockbox, RegistryLockbox, EncryptedRegistry, AnonymousRegistry, Usage, Quota, Version, SignerID, Signature, QuotaEnabled, Nonce}.-
OwnerID: The immutable identity authorized to manage the group. Can be a
UserIDor anotherGroupID. -
Nonce: A mandatory 16-byte random value. The Group ID is cryptographically bound to the OwnerID and this Nonce during creation (
GenerateGroupID). - ClientBlob: AES-GCM encrypted metadata (e.g., Group Name).
- Lockbox: Shares Group Private Signing Keys and Epoch Keys among all named members.
- AnonymousLockbox: An unordered array of ciphertexts containing the Group keys, encrypted for anonymous members.
-
RegistryLockbox: Shares a symmetric Registry Key only among authorized managers (
OwnerID). - EncryptedRegistry: An opaque blob containing named member UserIDs, encrypted with the Registry Key.
- AnonymousRegistry: An opaque blob containing anonymous member Public Keys, encrypted with the Registry Key.
- Usage: Tracks inodes and bytes used by files assigned to this group.
-
Quota: Resource limits for the group (Effective only if
QuotaEnabledis true). - QuotaEnabled: An immutable boolean decided at group creation. If true, the group is the primary debtor for all its files. If false, the individual file owners are charged.
- Version: Incremental counter for optimistic concurrency control.
-
Signature: ML-DSA signature over the group metadata, signed by the
SignerID.
-
OwnerID: The immutable identity authorized to manage the group. Can be a
-
-
Membership Indices:
-
UserID -> List[GroupID](Direct Membership Index). -
OwnerID -> List[GroupID](Ownership/Management Index).
-
-
Inode Structure:
-
UUID -> {OwnerID, GroupID, GroupSignerID, Mode, Nonce, Manifest, Lockbox, ClientBlob, UserSig, GroupSig, OwnerDelegationSig, IsRoot}.-
GroupSignerID: The ID of the group whose signing key was used for the
GroupSig(enables attribution). -
Nonce: A mandatory 16-byte random value committed at creation to cryptographically bind the
OwnerIDto the Inode ID. - IsRoot: A boolean flag enforcing that the root directory is never group or world writable, maintaining strict namespace integrity.
- ClientBlob: AES-GCM encrypted metadata (SymlinkTarget, MTime, ACLs, InlineData). Note: Primary filenames are NOT stored in the Inode.
-
GroupSignerID: The ID of the group whose signing key was used for the
-
-
Directory Structure: The Metadata Layer MUST know the file system hierarchy to enforce permissions and perform Garbage Collection.
-
Directory Inodes: Store a list of children:
HMAC(Name) -> ChildEntry{InodeID, EncryptedName, Nonce}. This allows traversal and$O(1)$ directory listings without the server knowing plaintext names. -
File Inodes: Store
ChunkManifest(List of Chunk IDs + DataNode locations). - Garbage Collection: Orphaned Inodes and Chunks (not referenced by any live Inode) are garbage collected.
-
Directory Inodes: Store a list of children:
DistFS ensures the integrity of file metadata (chunk manifests) using Dual-Signature Authorization. This prevents a compromised Metadata Server from silently modifying file contents or rolling back to old versions.
- Signer Attribution: Every manifest update includes a public
SignerIDand a correspondingUserSigsigned by that user's PQC Identity Key (ML-DSA). This allows any reader to verify the integrity of the manifest before attempting decryption. - Ownership Immutability: To prevent "Quota Hijacking" and maintain non-repudiable attribution, the
OwnerIDof an inode is immutable once the inode is created. Ownership cannot be transferred between users. - Group Authorization (GroupSig): If a file is assigned to a group, it must be signed with the Group Signing Key. Furthermore, the FSM enforces that any update changing the
GroupIDor modifying a group-owned file must be signed by a user who is an authorized member of that group. - Verification: Readers verify all signatures against the manifest hash. If the signatures do not match, or if the
SignerIDlacks the required authority (Owner or Group Member), the client rejects the file as tampered.
DistFS follows a strict subset of POSIX permissions designed for Zero-Knowledge security:
- Owner: Full
rwxsupport. - Group: Full
rwxsupport via shared cryptographic keys. - Other (World): Strictly Read-Only or None.
- Prohibition: The "Write" bit for 'Other' (0002) is strictly prohibited. The Metadata Server will reject any
chmodormkdirrequest that attempts to grant world-write access. Verifiable integrity cannot be maintained for anonymous writers.
- Prohibition: The "Write" bit for 'Other' (0002) is strictly prohibited. The Metadata Server will reject any
- Engine: Hashicorp Raft with BoltDB.
- Snapshot Strategy: Use
MetadataSnapshot(Streaming BoltDB).- Fast Startup:
NoSnapshotRestoreOnStart = true. MetaNodes rely on disk persistence and only replay trailing logs on startup to ensure fast recovery.
- Fast Startup:
Metadata operations in DistFS follow an Optimistic Concurrency Control (OCC) model with client-side authority over versioning.
- Strict Sequentiality: The server enforces that every update to an Inode or Group must increment its
Versionfield by exactly one. - Client-Side Authority: The client is responsible for fetching the latest state, applying mutations, and signing the new version.
- Lease Enforcement: Linearizability is guaranteed via server-side lease enforcement (see Section 5 of SERVER-API.md).
- Atomic Merge Pattern: The client library provides mutation callbacks that automatically re-fetch and retry on version conflicts (HTTP 409).
- Structural Validation: To maintain namespace integrity, the Metadata Server performs batch-wide structural checks:
- NLink Consistency: The server validates that changes to a directory's
Childrenmap are matched by corresponding changes to the child Inode'sNLinkcount within the same atomic batch. - Link Bidirectionality: For every addition to a directory's
Childrenmap, the child Inode MUST include a reciprocal entry in itsLinksmap. - Type Integrity:
DirTypeinodes cannot be hard-linked (maximumNLinkof 1).FileTypeandSymlinkTypeinodes MUST have an emptyChildrenmap.
- Root Protection: Inodes with no parent links (Filesystem Roots) are protected from deletion and unlinking operations.
- Empty Directory Protection: A request to delete a directory Inode (NLink=0) is rejected if the directory still contains children in its metadata.
- NLink Consistency: The server validates that changes to a directory's
DistFS employs a "Registry-Backed" model where the definitive source of truth for a group's identity and keys is a signed attestation file in the /registry directory.
Architectural Boundary: The "Registry" is entirely a Client-Side concept. To the server's FSM, the /registry directory and its contents are simply opaque Inodes and encrypted ClientBlob data. The metadata server never parses or verifies a GroupDirectoryEntry or DirectoryEntry. All attestation verification is performed locally by the client.
- Ownership Model (Immutable):
- User-Owned: If
OwnerIDmatches aUserID, only that user can sign updates. - Group-Owned: If
OwnerIDmatches aGroupID, any authorized member of the owning group can manage the group. This includes Self-Management (Group A owning itself). - Immutability: The
OwnerIDis set at creation by an Administrator and cannot be changed.
- User-Owned: If
- The Registry Attestation: Every group is represented by a
/registry/<name>.groupfile.- Content: A signed
GroupDirectoryEntrycontaining the GroupID and its public keys. - Permissions: Visibility is granted to the
usersgroup via POSIX ACLs (r-xon directory,r--on file). - Trust Anchor: Clients verify that the keys and OwnerID stored in the FSM match the signed attestation in the Registry.
- Content: A signed
- Privacy (Cryptographic Capabilities): To prevent membership leaks, the FSM no longer stores membership lists for authorization. The server simply verifies the
GroupSigon the request against the Group's Public Signing Key. The server knows the requester is an authorized member (because they possess the signing capability), but it does not know which member. - Management Authorization (The One-Hop Rule): Management checks are limited to a single level. If Group A is owned by Group B, and Group B is owned by Group C, a member of Group C cannot manage Group A unless they are also an explicit member of Group B.
- Signature Requirement: All group updates must be signed by the requester's ML-DSA Identity Key. The server verifies this signature and confirms the signer is an authorized manager based on the ownership model.
- Decoupled Registry Anchoring: Groups are created in the FSM by any user. However, for a group to be widely discoverable and verifiable via the Sovereign Chain of Trust, an Administrator must explicitly "anchor" it by creating the
/registry/<name>.groupattestation file using the standard filesystem API. - Client-Side Verification (Dual-Cache Architecture):
- Verified Cache: Operations requiring trust (e.g., encrypting new files, verifying delegations) strictly require the group to pass the full
VerifyGroupflow (cross-checking the registry attestation). These are stored in averifiedGroupCache. - Unverified Cache (Read-Only): To break circular dependencies during path resolution (e.g., reading the
/registryitself) and to optimize read performance, clients use anunverifiedGroupCache. The client extracts the decryption key from the unverified group's Lockbox. If the server spoofed the key, the subsequent AEAD decryption of the targetInodeinherently fails, preserving the Zero-Knowledge boundary without requiring full registry verification for every read operation.
- Verified Cache: Operations requiring trust (e.g., encrypting new files, verifying delegations) strictly require the group to pass the full
To support collaboration without a central directory, the metadata layer provides authenticated users with a way to discover groups they are involved in.
- Group List API: An authenticated user can query for a list of groups where they have a defined role.
- Role Resolution: The server identifies the user's role for each group:
- Owner: The user is the direct
OwnerID. - Manager: The user is a member of a group that is the
OwnerID. - Member: The user is explicitly listed in the named
LockboxorEncryptedRegistry(Anonymous members cannot use server-side discovery).
- Owner: The user is the direct
- Privacy Preservation: The server returns only the
GroupID, the encryptedClientBlob, and the resolvedRole. The MetaNode does not know the plaintext names; the client must use its local keys to decrypt and display the group names to the user.
DistFS enforces multi-tenant resource limits at both the User and Group levels to ensure fair resource allocation and prevent accidental or malicious exhaustion of cluster storage.
- Quota Metrics: The system tracks two primary metrics:
- Inodes: The total number of files and directories owned by the entity.
- Bytes: The total logical size of all data chunks referenced by the entity's inodes.
- Enforcement Hierarchy (Debtor Resolution): When an operation (e.g., file creation, write, or group assignment) occurs, the server identifies the primary debtor based on the target Inode's
GroupID:- Group Debt: If the Inode belongs to a group with
QuotaEnabled: true, the Group is charged exclusively. The Group's quota is enforced, and the User's personal quota is ignored. - User Debt (Fallback): If the group has
QuotaEnabled: false(or the Inode has noGroupID), the individualOwnerID(User) is charged.
- Group Debt: If the Inode belongs to a group with
- Security & Immutability: The
QuotaEnabledflag and the InodeOwnerIDare immutable. This prevents users from maliciously shifting storage costs to other users. Assignment to a group is only permitted if the signer is a member of that group. - Atomic Accounting: Usage counters are updated atomically within the same Raft transaction as the metadata mutation.
- Admin Management: Resource limits are managed by cluster administrators via the Admin CLI. Limits can be updated dynamically without affecting existing data availability.
To support multi-tenancy and specialized organizational structures, DistFS supports the creation of multiple independent filesystem roots on a single cluster.
- Independent Hierarchies: While the system provides a default root (
metadata.RootID), administrators can initialize any number of independent directory trees. Each root is a fully functional, self-contained filesystem with its own encryption keys and lockbox. - Explicit Initialization: Roots must be explicitly initialized using the
admin-create-rootcommand. This ensures that the cluster does not automatically create namespace structures unless directed by an authorized administrator. - Client-Side Chroot: The client library and FUSE mount tool support "chrooting" to any authorized Inode ID. When a client is rooted at a specific Inode, all path resolutions (starting from
/) are relative to that Inode.- Configuration: The client configuration file (
config.json) uses aRootsmap, allowing users to configure and easily switch between multiple named roots without configuration corruption or manual editing.
- Configuration: The client configuration file (
- Isolation: A chrooted client has no visibility or access to the original global root or other siblings in the hierarchy, providing a robust mechanism for namespace isolation.
To support Out-Of-Band (OOB) identity verification without centralizing trust on the metadata server, DistFS implements a Distributed Directory Service (conceptually similar to /etc/passwd). It is important to note that participation in this registry is entirely optional. The registry acts as a UX overlay to facilitate human-readable discovery; the core cryptographic operations of the filesystem rely exclusively on the UserID and underlying keys, not the registry itself.
- The Registry Structure: A registry is stored as a standard DistFS directory (e.g.,
/registry). Access to manage this directory is governed by standard group permissions (e.g., theregistrygroup). - Individual Attestations: Each verified user is represented by an individual file (e.g.,
alice.user) within the registry directory. This file contains a signedDirectoryEntryJSON blob, which includes:- The user's human-friendly Username and Full Name.
- The user's PQC Public Keys (
ekandsk). - The VerifierID: The User ID of the administrator or trusted member who performed the OOB check.
- An Attestation Signature generated by the Verifier over all the above fields.
- Transitive Trust: By maintaining a shared group address book, organizations can implement "Trust Once, Share Everywhere". If an authorized verifier adds a signed attestation to the registry, all other users in the cluster can inherit that trust.
DistFS employs a strict "Zero-Trust" posture for new registrations, preventing unauthorized data access and resource exhaustion.
- Open Registration: Users authenticate and register their hardware keys via an OIDC flow (e.g.,
distfs init). The server creates aUserrecord in the FSM. - Locked State: Upon creation, all new user accounts are explicitly marked as
Locked: true.- A locked user cannot read or write any metadata, traverse directories, or allocate storage chunks.
- Crucially, a locked user cannot retrieve the
WorldIdentityprivate key, preventing them from accessing world-readable files before they are formally vetted. - A locked user's default storage and inode quota is strictly Zero.
- Administrative Onboarding: To gain cluster access, a new user must undergo a guided onboarding flow (
distfs registry-add --unlock):- OOB Verification: An admin verifies the user's PQC key fingerprint via an external channel using a 3-byte hex security code (e.g.,
4A-B2-CF). - Attestation: The admin creates the user's entry in the canonical
/registry. - Unlock & Quota: The admin issues an FSM command to set
Locked: falseand provisions an initial quota. - Workspace: The admin provisions a home directory (
/users/<username>) and grants the user traversal rights by adding them to theusersgroup.
- OOB Verification: An admin verifies the user's PQC key fingerprint via an external channel using a 3-byte hex security code (e.g.,
DistFS uses opaque UUIDs for Inodes rather than a strict Merkle Tree (which causes severe concurrency bottlenecks). To prevent a compromised server from modifying an Inode's metadata to swap its OwnerID or GroupID (and thus allowing an attacker to self-sign malicious payloads), DistFS enforces strict cryptographic provenance.
- Cryptographic ID Commitment: When a client creates a new Inode, the
Inode.IDis generated as a cryptographic hash of the creator'sUserIDand a random nonce (ID = Hash(OwnerID || Nonce)). TheNonceis stored in the Inode. DuringVerifyInode, the client independently verifies this hash. If a compromised server changes theOwnerIDin the database, the hash verification will fail, guaranteeing that theOwnerIDis mathematically immutable and bound to the ID referenced by the parent directory. - Owner Delegation Signature: If the
OwnerIDgrants write access to aGroupIDor a specific user via an ACL, they must cryptographically sign that delegation. TheInodestruct includes anOwnerDelegationSig. When evaluating an Inode signed by someone other than theOwnerID, the client first verifies theOwnerDelegationSigusing the true Owner's public key. If valid, it proves the Owner explicitly authorized the current ACLs/Group assignments, closing the "Self-Signed Bypass" vulnerability.
DistFS implements POSIX.1e draft standard Access Control Lists natively within the metadata layer. This allows fine-grained, user-level access delegations without requiring the creation of administrative groups.
- Schema Mapping: ACLs are stored natively in the
Inodestruct asAccessACLandDefaultACL. They adhere strictly to the POSIX algorithm, evaluating permissions in the order of: Owner -> Named Users -> Primary Group -> Named Groups -> Other, intersected with theMaskentry. - Cryptographic Expansion (The Lockbox Cost): To maintain end-to-end encryption, the FSM guarantees that any user or group granted effective read permission via an ACL is included in the cryptographic Lockbox. If an ACL grants 10 specific users read access, the client must fetch 10 public keys and encapsulate the file key 10 times, resulting in a larger metadata footprint (~1 KB per recipient).
- Default ACLs (Directory Inheritance): DistFS supports
DefaultACLentries on directories. Any file or directory created within inherits these permissions.- Cost Acknowledgment: Unlike local filesystems where inheritance is merely a bitwise copy, DistFS inheritance triggers cryptographic operations. When a client creates a file in a directory with Default ACLs, it must proactively build the expanded Lockbox before the file can be committed to the cluster.
- FUSE Integration: The FUSE client exposes these ACLs via the standard
system.posix_acl_accessandsystem.posix_acl_defaultextended attributes (xattrs). This allows standard Linux utilities likesetfaclandgetfaclto work seamlessly within the mount.
Files are split into fixed-size chunks of 1 MB. The client library handles padding (hiding exact file size) and encryption.
- Goal: 3 copies of each chunk (RepFactor=3).
- Constraint: A node must never hold more than one copy of the same chunk.
- Distribution: Chunks are distributed using Consistent Hashing weighted by available disk space.
- Fallback: If
Nodes < 3, redundancy ismin(3, NodeCount).
- Prepare: Client encrypts the chunk and calculates its Hash (
ChunkID). - Allocate: Client requests allocation for
ChunkIDvia Metadata API (POST /v1/meta/allocate). The Leader selects 3 target nodes. - Push: Client pushes the Encrypted chunk to the Primary Node with
replicas=Secondary,Tertiary. - Replicate: Primary forwards the data to Secondary, which forwards to Tertiary (Pipelined).
- Ack: Once all 3 acknowledge, Primary acks the Client.
- Commit: Client updates the file metadata (Chunk Manifest) on the Raft Leader.
- Replication Monitor: The Leader periodically scans chunk manifests.
- Under-Replicated: If a node is missing for >
TBDminutes, the Leader triggers a replication job to copy the chunk to a new healthy node. - Over-Replicated: Extra copies (e.g., node returns after temporary partition) are garbage collected to reclaim space.
- Under-Replicated: If a node is missing for >
- Node Draining: An admin API
POST /v1/node/{id}/draintriggers a proactive replication of all chunks on a specific node to the rest of the cluster, allowing safe removal. - Integrity Checks: Each node runs a background "Scrubber" process. It periodically reads all local chunks and verifies their checksums against the filename (Content-Addressable Storage). Corrupt chunks are quarantined and reported to the Leader for repair.
- Nodes store chunks as flat files:
data/chunks/{shard}/{chunk_id}. - Self-Validation: Chunks are content-addressed (Hash of encrypted content).
- Chunk Level: Writes to DataNodes are atomic. A chunk is either fully written and validated or rejected. Replacements use new Chunk IDs or versioned writes; existing chunks are immutable.
- Path Level (Atomic Swap): High-level mutation APIs (
OpenManyForUpdate,OpenBlobWrite,SaveDataFile) utilize an Atomic Path Swap pattern.- The client acquires an Exclusive Lease on the Filename (or Path) to prevent concurrent atomic updates.
- The client writes data to a New Inode. Existing readers continue to see the old Inode.
- On
Close()or commit, the client performs a batch metadata update that atomically points the directory entry to the New Inode and decrements the old Inode's link count. - Active readers of the old Inode are unaffected as they hold leases on the Inode ID, not the path.
- POSIX Level: Standard FUSE operations (e.g.,
write,truncate) follow traditional POSIX semantics, potentially mutating an existing Inode in-place if not unlinked.
Data Nodes enforce permissions using Capability Tokens issued by the Metadata Leader.
- Flow:
- Client requests access to File X from Metadata Leader.
- Leader checks permissions (ACL/Group).
- Leader issues a time-bound Signed Token granting READ/WRITE access to the specific Chunk IDs associated with File X.
- Client presents Token to Data Node.
- Data Node verifies signature and expiry before serving data.
To achieve high-fidelity POSIX compliance, DistFS ensures that unlinked files (where NLink == 0) persist on storage nodes as long as they are being actively read or written by a client.
- Usage Leases: When a client opens a file, it acquires a Shared Usage Lease on the Inode. This lease acts as a signal to the cluster that the file is in use.
- Deferred Deletion: If a file is deleted (e.g., via
unlink), the Metadata Server decrements its link count. IfNLinkbecomes zero:- The Inode is removed from the directory namespace (it can no longer be "found" by new
Openrequests). - If active leases exist, the Inode is marked as Unlinked (Pending Delete).
- The Inode and its associated chunks are not enqueued for Garbage Collection yet.
- The Inode is removed from the directory namespace (it can no longer be "found" by new
- Lease Heartbeat: Clients periodically renew their usage leases as long as the file handle is open. If a client crashes, the lease will naturally expire.
- Final Cleanup: The Metadata Server triggers the final deletion (quota reclamation and chunk GC enqueuing) only when the link count is zero and all usage leases have expired or been explicitly released.
The client library implements io.fs.FS and io.fs.File.
-
Open(name string):- Resolve path by traversing Directory Inodes (fetching
Children). - Decrypt
ClientBlobfrom parent directory to find component IDs. - Fetch file metadata (
Lockbox+ChunkManifest+ClientBlob). - Decrypt
ClientBlobusing File Key fromLockbox. - Return a
Filehandle.
- Resolve path by traversing Directory Inodes (fetching
-
Read(b []byte):- Calculate which Chunk(s) correspond to the requested byte range.
- Lookup Chunk IDs and their associated Public URLs in the Inode's
ChunkManifest. - Execute Staggered Parallel Fetches (Hedged Requests):
- Initiate a download from the primary node.
- If the download hasn't finished within a 1-second threshold, initiate a parallel fetch from the next replica.
- Repeat until all replicas are exhausted or a download succeeds.
- Upon the first successful download, cancel all remaining parallel requests for that chunk.
- Decrypt chunk in memory and copy to
b.
-
Data File API (
ReadDataFile/SaveDataFile)ReadDataFile(ctx, name, data any): Reads and unmarshals a passphrase-encrypted JSON/Gob file from the namespace.SaveDataFile(ctx, name, data any): Marshals and writes a file using the Atomic Swap Protocol (exclusive filename lease + new inode creation).
-
Atomic Multi-File Operations
OpenManyForUpdate(ctx, paths []string, targets []any) (commit func(bool), error): Provides transactional write semantics across multiple files.ReadDataFiles(ctx, paths []string, targets []any) error: Provides a point-in-time consistent snapshot of multiple files by using shared filename-based leases during the path-resolution phase.
Communication between the Client and Cluster uses JSON over HTTP/2. The Metadata Server requires Layer 7 End-to-End Encryption (Sealing) for all mutations.
Full API Catalog: For exhaustive documentation of every endpoint, request/response schema, and error code, refer to SERVER-API.md.
- Identity Registry:
- Users:
HMAC(sub) -> Public Keys. No PII (names/emails) stored. - Groups:
UUID -> Public Keys.
- Users:
- User Registration:
- Federated Identity: Users must register via
POST /v1/user/registerproviding a valid OIDC ID Token (JWT). The server calculatesHMAC(sub)using the internal Cluster Secret and registers the keys against this hash. The subject claim is discarded immediately. - Automated Configuration: The cluster leader is configured with an OIDC Discovery URL. It exposes the necessary authorization and token endpoints to clients via the
/v1/auth/configendpoint, enabling zero-config onboarding.
- Federated Identity: Users must register via
- Authentication:
- Client Auth: Client authenticates with Metadata Server via Sealed Tokens (signed/encrypted) proving identity.
- Chunk Access: Client authenticates with Data Nodes via Signed Capability Tokens issued by Metadata Server.
To provide high-fidelity POSIX compatibility, DistFS implements the following specialized operations:
Fsync: Ensures that all dirty data for a file is committed to the data nodes and the inode metadata is updated on the Raft leader before returning.Statfs: Reports cluster-wide storage capacity and user-specific remaining quota (MaxBytes/MaxInodes).Forget: Handles kernel-level node eviction to prevent memory leaks in the client during long-running mounts.- Incremental
ReadDir: Uses streaming directory entries to support large directories without blocking on a single massive metadata fetch.
Out of Scope: CopyFileRange
Server-side copying is currently not supported because DistFS maintains Zero-Knowledge privacy. Since every file is encrypted with a unique symmetric key, copying data between files would require the server to decrypt and re-encrypt the content (or reuse keys, which weakens the security model), violating the core security mandate. All copies must be performed client-side.
DistFS supports compiling the core client (pkg/client) to WebAssembly, enabling a Zero-Knowledge "file manager" UI directly within the browser sandbox.
- Browser as the Trusted Boundary: All PQC math, metadata parsing, and AES-256-GCM encryption/decryption happen inside the client's browser. The server never receives plaintext or unsealed RPCs from the web client.
- Service Worker Streaming: To support multi-gigabyte file downloads without exhausting browser RAM (Out-Of-Memory errors), the web client utilizes a Service Worker. When a user requests a file, the Service Worker intercepts a synthetic request and responds with a
ReadableStream. A dedicated WASM Web Worker fetches, verifies, and decrypts 1MB chunks just-in-time, passing them to the Service Worker via Transferable Objects, which streams them directly to the user's local disk. - CORS & Fallback Transports: The storage nodes expose configurable CORS headers, and the WASM client uses standard
fetchAPI fallbacks (bypassing custom TCP/TLS transports) to operate within browser networking constraints.
The backend utilizes three primary ports for its operations, ensuring separation of concerns:
- Public HTTP Port (
--addr): Client-facing API port (default:8080). - Internal HTTP Port (
--cluster-addr): Dedicated mTLS-secured API for inter-node communication (default:9090). - Raft Port (
--raft-bind): Internal TCP transport for Raft consensus traffic (default:8081).
Port Advertisement: To support containerized and NATed environments, nodes must explicitly advertise their public addresses:
--cluster-advertise: Publichost:portfor the internal cluster API.--raft-advertise: Publichost:portfor Raft traffic.
The cluster employs a Zero-Trust security model where no node is inherently trusted.
- Node Key: Each node generates a persistent private key (
node.key) on first startup. By default, this is a software-generated Ed25519 key. If the--use-tpmflag is provided, a hardware-bound ECC P-256 key is generated inside the local TPM, and only the key handle is stored on disk, ensuring the private key material never exists in system memory. - Node ID: The unique Raft Node ID is derived from the first 8 bytes of the public key.
- Mutual TLS (mTLS): All inter-node communication (Cluster API and Raft) is secured via mTLS. Nodes exchange self-signed certificates signed by their
node.key(or the TPM). Connections are only accepted if the peer's public key is in the authorizedNodeMetalist.
To solve the initial trust problem, new nodes use Trust On First Use (TOFU):
- Fresh State: A node with no history enters TOFU mode.
- Temporary Trust: It temporarily accepts a connection from an unknown peer (assumed to be the Cluster Leader).
- State Acquisition: The node receives the authoritative
NodeMeta(list of trusted public keys) from the Leader. - Strict Mode: Upon initialization, the node permanently switches to Strict Mode, enforcing the authorized key list for all future connections.
DistFS provides a comprehensive administrative interface for cluster operators. To ensure maximal security, management is performed via an interactive Command-line User Interface (CUI) within the distfs binary.
- PQC-Powered Authorization: Access to administrative functions is controlled by individual user identities rather than a shared secret.
- Admin Registry: The FSM maintains a persistent
adminsbucket. - Bootstrap: The first user to register with a new cluster is automatically granted administrative privileges.
- Promotion: Existing admins can promote other users to admin status via signed Raft commands.
- Admin Registry: The FSM maintains a persistent
- Secure Authentication: Admins authenticate using their standard PQC Identity Keys. All admin requests are SealedRequests (Layer 7 E2EE), ensuring that actions are cryptographically signed and non-repudiable.
- Management Features:
- Overview: Real-time visibility into Raft state, leadership, and commit index.
- User Management: Monitor anonymized usage (
TotalBytes,InodeCount) and adjust quotas. - Group Management: Monitor group usage and manage group resource quotas.
- Node Operations: Monitor storage node health, join new nodes, or decommission existing ones.
- Administrative Namespace Setup:
- mkdir --owner: Admins can create new empty directories owned by any user. This allows administrators to set up user home directories or shared project spaces without having access to the users' private keys or file content.
- Redaction: Administrative listing APIs (Users, Groups, Nodes) return redacted records, stripping private keys and other sensitive material to maintain the Zero-Knowledge boundary.
- Distributed Lock Visibility: Real-time monitoring of active Inode leases and lock ownership to diagnose contention.
- System Metrics: Visualize cluster performance, including Raft commit latency, I/O throughput, and disk utilization across nodes.
- Deployment: The admin console communicates with the standard API port. Because it relies on Layer 7 E2EE and PQC signatures, it does not require mTLS for client access.
- Write requests sent to Follower nodes are automatically forwarded to the Leader via the Internal Cluster API.
- Read requests can be served locally by Followers (using Read-Index for consistency).
DistFS utilizes a two-tiered trust model to resolve the circular dependency between FSM encryption and node bootstrapping.
- Tier 1: Local Node Vault:
- On initial bootstrap, the Leader generates a high-entropy ClusterSecret and stores it in its local node-local encrypted vault (protected by the node's unique
MasterKey). - During the
Joinhandshake, the Leader retrieves the ClusterSecret and the current FSM KeyRing. It encapsulates both for the joining node's Public Encryption Key. - The joining node decrypts the payload, persists the
ClusterSecretin its local Tier 1 vault, and initializes its local BoltDBsystembucket with theFSM KeyRing. This ensures the node is cryptographically ready to apply Raft logs immediately upon joining.
- On initial bootstrap, the Leader generates a high-entropy ClusterSecret and stores it in its local node-local encrypted vault (protected by the node's unique
- Tier 2: Cluster Root of Trust (FSM):
- The BoltDB
systembucket contains the cluster-wide root metadata, including the FSM KeyRing. - Values in the
systembucket are encrypted using a key derived from theClusterSecret. - All other buckets (Inodes, Users, Groups) are encrypted using the rotating
FSM KeyRing.
- The BoltDB
- Snapshot Portability:
- When a Raft snapshot is transferred to a Follower, the
systembucket remains encrypted with theClusterSecret. - Since every authorized Follower has the
ClusterSecretin its local Tier 1 vault, it can immediately decrypt the root anchors and bootstrap its local FSM state.
- When a Raft snapshot is transferred to a Follower, the
DistFS uses a rigorous, recursive trust model to initialize the cluster without relying on hardcoded secrets or central authorities.
The first user to register with a cluster ("Alice") becomes the sovereign anchor for the entire system.
- Identity Anchor: Alice registers her PQC keys. The server automatically grants her administrative privileges as the first user.
- Namespace Root: Alice creates the root Inode (
/) and becomes its immutable owner. - Backbone Provisioning: Alice creates the system namespaces (
/registry,/users) and the foundational groups (admin,registry,users). - Permission Delegation: Alice grants the
usersgroup Read-Only access to the backbone structures (/,/registry,/users) via POSIX ACLs. - Self-Attestation: Alice creates her own identity file
/registry/alice.user, signed with her private key.
When Alice registers a new user ("Bob"), she facilitates a cryptographic hand-off:
- Registration: Bob registers his public keys with the server (Locked by default).
- Backbone Access: Alice unlocks Bob and adds him to the
usersgroup. - Lockbox Update: Adding Bob to the
usersgroup cryptographically encapsulates theusersgroup private key for Bob's public key. Bob now has the mathematical means to read the backbone.
To break circular dependencies during trust bootstrapping, DistFS uses an Aggregate Optimistic Verification algorithm. Identity verification is split into two asynchronous phases:
-
Optimistic Phase (Discovery):
- VerifyInode: When an Inode is fetched, the client immediately verifies its ML-DSA signature using the signer's public key (retrieved from the server). It then adds the
SignerID,OwnerID, and anyACLmembers to an aggregate Verification Queue. - VerifyGroup: Similarly, when a Group is fetched, its signature is verified using the server-provided signer key, and the group's identities are added to the Verification Queue.
- Proceed: The client continues the operation (e.g., resolving the next path component) using these provisionally valid keys.
- VerifyInode: When an Inode is fetched, the client immediately verifies its ML-DSA signature using the signer's public key (retrieved from the server). It then adds the
-
Confirmation Phase (Anchoring):
- Once the target object is reachable (or at logical checkpoints like the end of
ResolvePath), the client processes the Verification Queue. - For each ID in the queue, it resolves the registry anchor (
/registry/<ID>.useror.group-id) using the same Optimistic Phase logic (ensuring no recursion). - It verifies the attestation signature using the verifier's key from the server.
- It cross-checks the keys used in the Optimistic Phase against the keys committed in the registry.
- If the cross-check passes, the identities are promoted to the
verifiedGroupCache.
- Once the target object is reachable (or at logical checkpoints like the end of
This separation allows VerifyInode and VerifyGroup to remain fast and recursion-free, while still ensuring that every key used is eventually validated against the cluster's sovereign anchors.
To ensure the long-term stability of the trust anchor and protect against server-side "Identity Swapping" attacks, clients implement Trust On First Use (TOFU) for the Root Owner.
- Pinned Anchor: The first time a client resolves the root inode and successfully verifies the owner's identity via the registry, it saves the
RootOwnerPublicKeyin its local encrypted configuration. - Immutability Enforcement: In all subsequent sessions, the client verifies that the Root Owner's ID and Public Key match the pinned values.
- Local Fallback: If the
/registrybecomes unavailable or is tampered with, the client can use its pinnedRootOwnerPublicKeyto verify foundational structures (like theusersgroup) that were signed by the anchor.