After the chunker scatters a file across hundreds of segments, you need something to keep track of where everything is. That's FileStore. It scans the archive, reads manifests, and handles the "give me this file" and "fix this corrupted segment" logic.
No chunking happens here, no encoding - just discovery, reconstruction, and repair. I built it to be stateful (remembers the archive path) because passing that path to every function call was getting annoying.
filestore/
├── mod.rs # Discovery, reconstruction, path utilities
├── health.rs # Repair functions per tier
├── models.rs # File and manifest data structures
└── tests.rs # Health check and reconstruction tests
Unlike the chunker (stateless), FileStore holds onto the archive path.
The archive manager. You point it at a directory, it figures out what's inside.
let store = FileStore::new(Path::new("archive_directory"))?;
let files = store.get_all()?; // Scans and returns all committed filesStateful - remembers the archive path. Create once, use many times.
Represents one committed file with everything you need to work with it.
pub struct File {
pub file_name: String, // Original filename (e.g., "movie.mkv")
pub file_data: FileData, // Hash and manifest path
pub manifest: ManifestFile, // Parsed manifest.json (tier, segments, merkle tree, etc.)
}When you call store.find("movie.mkv"), you get back a File struct. From there, you can reconstruct it, repair it, or query its metadata.
Minimal info needed to locate a file in the archive.
pub struct FileData {
pub hash: String, // BLAKE3 hash of original file (unique identifier)
pub path: String, // Path to manifest.json for this file
}Filenames can collide (two files named data.bin with different content). The hash is the true identity. Archive directories are named filename_hash for uniqueness.
Scans the entire archive directory, finds all manifest.json files, parses them, returns a list of File objects.
let store = FileStore::new(Path::new("archive_directory"))?;
for file in store.get_all()? {
println!("{}: {} bytes, tier {}",
file.file_name,
file.manifest.size,
file.manifest.tier
);
}How it works:
- Walk the archive directory tree
- Find every
manifest.jsonfile - Parse the JSON →
ManifestFile - Extract filename and hash from the directory name (
filename_hash) - Build
Filestruct, add to results
Performance: Scans in O(n) where n = number of committed files. For 1000 files, takes ~100ms on HDD, ~10ms on SSD.
Locate a specific file by name. Faster than get_all() if you know what you want.
let dataset = store.find(&"my_dataset.bin".to_string())?;How it works:
- Call
get_all()(yeah, we scan everything, no index yet) - Filter for matching filename
- Return first match or error if not found
TODO: Build an in-memory index on first scan to make subsequent find() calls O(1).
Returns just the paths to all manifest.json files, doesnt parse them. Useful for scripts that just need to know what exists.
let manifests = store.all_files()?;
println!("Archive contains {} files", manifests.len());Rebuilds the original file from its segments. Output goes to reconstructed/{filename}.
let file = store.find(&"movie.mkv".to_string())?;
store.reconstruct(&file)?;
// File written to: reconstructed/movie.mkvChecks file.manifest.tier and calls the appropriate reconstruction method.
The original file is stored as data.dat. Copy it.
let data = fs::read("archive/file_hash/data.dat")?;
fs::write("reconstructed/file.txt", data)?;For segmented files, reassemble by reading segments in order and concatenating them.
Process:
- Open output file for writing
- For i in 0..num_segments:
- Read
segment_i.dat(tier 2) orblock_X/segments/segment_Y.dat(tier 3) - Append to output file
- Read
- Verify final file hash matches manifest
Performance: Limited by sequential disk read speed. For a 10GB file on HDD: ~60 seconds. On SSD: ~10 seconds.
When a segment corrupts, we can mathematically reconstruct it from the surviving segments and parity shards.
Entry point for self-healing. Detects corruption, fetches parity, runs Reed-Solomon recovery, writes fixed segments back to disk.
let file = store.find(&"important_data.bin".to_string())?;
store.repair(&file)?; // Auto-detects tier and repairsChecks tier, calls repair_tiny, repair_segment, or repair_blocked.
Tier 1 files have 1 data file (data.dat) and 3 parity files. If data.dat corrupts, copy a parity file over it. No Reed-Solomon decoding needed.
Strategy:
- Read
data.dat, compute BLAKE3 hash - Compare to
manifest.original_hash - If match - file is healthy, done
- If mismatch - corruption detected, try parity files
- For each
parity_N.dat:- Read it, compute hash
- If hash matches manifest - copy to
data.dat, done
- If no parity files match - unrecoverable (all 4 copies are corrupt)
RS(1,3) encoding means the 3 parity files are functionally identical to the data file. Any one of them can replace the original.
Tier 2 files have per-segment parity. Check each segment independently, recover the corrupt ones.
Strategy:
- For each segment index
i:- Read
segment_i.dat+ its 3 parity files - Compute combined hash (segment + parity)
- Compare to Merkle leaf for segment
i
- Read
- Collect list of corrupt segment indices
- For each corrupt segment:
- Read the 3 parity shards for that segment
- Use Reed-Solomon RS(1,3) decoder
- Input: 3 parity shards (data shard is missing/corrupt)
- Output: recovered original segment
- Verify recovered segment hash matches manifest
- Write recovered segment to
segment_i.dat
RS(1,3) can recover the 1 data shard from ANY 1 of the parity shards. Even if the segment is completely gone, we can reconstruct it perfectly.
Limitations:
- Can recover if segment is corrupt/missing but parity is intact
- Cannot recover if segment + all 3 parity are corrupt (need at least 1 valid shard)
Performance: For a 1GB file with 10 corrupt segments out of 32 total segments:
- Detect corruption: ~2 seconds (hash all segments)
- Recover 10 segments: ~500ms (RS decoding is fast)
- Write recovered segments: ~300ms
- Total: ~3 seconds
Tier 3 uses block-level parity: 30 segments per block, 3 parity shards for the entire block. This means we can lose up to 3 segments per block and still recover.
The strategy:
- Iterate each
blocks/block_N/directory - For this block:
- Check all 30 segments for corruption (hash verification)
- Identify which segments are missing/corrupt
- If ≤3 segments are corrupt → recoverable via RS(30,3)
- If >3 segments are corrupt → unrecoverable (not enough data)
- For recoverable blocks:
- Read all valid segments (up to 27 if 3 are corrupt)
- Read the 3 block parity shards from
block_N/parity/ - Use Reed-Solomon RS(30,3) decoder
- Input: 27 data shards + 3 parity shards = 30 total shards
- Write recovered segments back to disk
How RS(30,3) recovery works: Reed-Solomon with 30 data shards + 3 parity shards creates 33 total shards. You need ANY 30 of those 33 to reconstruct all 30 originals. So you can lose:
- 3 data segments (use 27 data + 3 parity = 30 shards)
- 2 data + 1 parity (use 28 data + 2 parity = 30 shards)
- All 3 parity (use all 30 data = 30 shards)
Any combination works as long as you have 30 valid shards total.
Example recovery: Block 0 has 30 segments (segment_0 through segment_29) Corruption detected: segment_5, segment_12, segment_21
Recovery process:
- Read segments 0-4, 6-11, 13-20, 22-29 (27 valid segments)
- Read parity_0.dat, parity_1.dat, parity_2.dat (3 parity shards)
- Feed all 30 shards to RS decoder
- Decoder outputs segments 5, 12, 21 (recovered)
- Verify recovered segment hashes match Merkle tree
- Write segment_5.dat, segment_12.dat, segment_21.dat
Performance: For a 10GB file with 3 corrupt segments in one block:
- Detect corruption: ~5 seconds (hash all segments)
- RS decode block: ~1 second (30 × 32MB segments)
- Write 3 recovered segments: ~300ms
- Total: ~6 seconds
Why tier 3 repair is impressive: You can lose 3 out of every 30 segments (10% of the file) and still recover perfectly. Compare to tier 2 where losing 1 segment requires parity recovery, tier 3 is way more fault-tolerant for large files.
The FileStore abstracts away the messy directory structure. You dont need to remember if parity is in parity/ or blocks/block_N/parity/, these functions handle it.
Returns sorted list of all segment file paths for a file, regardless of tier.
let paths = store.get_segments_paths(&file)?;
// Tier 2: ["archive/file_hash/segments/segment_0.dat", "segment_1.dat", ...]
// Tier 3: ["archive/file_hash/blocks/block_0/segments/segment_0.dat", ...]Use case: When you need to iterate all segments (e.g., for full verification or reconstruction).
Returns all parity file paths for a file.
let parity = store.get_parity_paths(&file)?;
// Tier 1: ["archive/file_hash/parity_0.dat", "parity_1.dat", "parity_2.dat"]
// Tier 2: ["archive/file_hash/parity/segment_0_parity_0.dat", ...]
// Tier 3: ["archive/file_hash/blocks/block_0/parity/parity_0.dat", ...]Use case: When you need to verify or re-read parity data during repair.
get_segment_path(file, segment_id) -> PathBuf // Tier 2
get_block_segment_path(file, block, segment) -> PathBuf // Tier 3
get_data_path(file) -> PathBuf // Tier 1
get_parity_path_t1(file, parity_id) -> PathBuf
get_parity_path_t2(file, segment, parity_id) -> PathBuf
get_parity_path_t3(file, block, parity_id) -> PathBufThese handle the tier-specific directory structures so you dont have to hardcode paths everywhere.
Computes the Merkle leaf hash for a segment + its parity shards. Used during repair to verify segments.
let segment = fs::read("segment_5.dat")?;
let parity = vec![
fs::read("segment_5_parity_0.dat")?,
fs::read("segment_5_parity_1.dat")?,
fs::read("segment_5_parity_2.dat")?,
];
let combined_hash = store.hash_segment_with_parity(&segment, &parity)?;
if combined_hash != manifest.merkle_tree.leaves[5] {
println!("Segment 5 is corrupt!");
}Why hash segment+parity together? Because we want to detect parity corruption too. If we only hashed segments, corrupt parity would go unnoticed until we tried to use it for recovery (too late).
let store = FileStore::new(Path::new("archive_directory"))?;
for file in store.get_all()? {
match store.repair(&file) {
Ok(_) => println!("{}: healthy or repaired", file.file_name),
Err(e) => println!("{}: unrecoverable - {}", file.file_name, e),
}
}let store = FileStore::new(Path::new("archive_directory"))?;
let dataset = store.find(&"critical_data.bin".to_string())?;
store.repair(&dataset)?;
store.reconstruct(&dataset)?;| Error | Cause | Recovery |
|---|---|---|
| File not found | Manifest missing or corrupt | Re-commit original |
| Unrecoverable | Too many shards lost | Restore from backup |
| Parse error | Malformed manifest.json | Manual intervention |
| Segment hash mismatch | Segment corrupt, parity also corrupt | Restore from backup |
| Not enough shards for recovery | Too many segments/parity corrupt | Restore from backup |
| Permission denied | File permissions, disk full, SELinux/AA | Check permissions |
- Write recovered segments back to disk
How RS(30,3) recovery works: Reed-Solomon with 30 data shards + 3 parity shards creates 33 total shards. You need ANY 30 of those 33 to reconstruct all 30 originals. So you can lose:
- 3 data segments (use 27 data + 3 parity = 30 shards)
- 2 data + 1 parity (use 28 data + 2 parity = 30 shards)
- All 3 parity (use all 30 data = 30 shards)
Any combination works as long as you have 30 valid shards total.
Example recovery:
Block 0 has 30 segments (segment_0 through segment_29)
Corruption detected: segment_5, segment_12, segment_21
Recovery process:
1. Read segments 0-4, 6-11, 13-20, 22-29 (27 valid segments)
2. Read parity_0.dat, parity_1.dat, parity_2.dat (3 parity shards)
3. Feed all 30 shards to RS decoder
4. Decoder outputs segments 5, 12, 21 (recovered)
5. Verify recovered segment hashes match Merkle tree
6. Write segment_5.dat, segment_12.dat, segment_21.dat
Performance: For a 10GB file with 3 corrupt segments in one block:
- Detect corruption: ~5 seconds (hash all segments)
- RS decode block: ~1 second (30 × 32MB segments)
- Write 3 recovered segments: ~300ms
- Total: ~6 seconds
Why tier 3 repair is impressive: You can lose 3 out of every 30 segments (10% of the file) and still recover perfectly. Compare to tier 2 where losing 1 segment requires parity recovery, tier 3 is way more fault-tolerant for large files.
Returns sorted list of segment file paths.
Legacy: returns chunk paths within segments (Gen 1 structure).
Returns parity file paths for a file.
Computes Merkle root of segment + parity for integrity checking.
let combined_hash = store.hash_segment_with_parity(&segment, &parity_vec)?;
if combined_hash != manifest.merkle_tree.leaves[idx] {
// Segment is corrupt
}let store = FileStore::new(Path::new("archive_directory"))?;
for file in store.get_all()? {
match store.repair(&file) {
Ok(_) => println!("{}: healthy or repaired", file.file_name),
Err(e) => println!("{}: unrecoverable - {}", file.file_name, e),
}
}let store = FileStore::new(Path::new("archive_directory"))?;
let dataset = store.find(&"critical_data.bin".to_string())?;
store.repair(&dataset)?;
store.reconstruct(&dataset)?;| Error | Cause | Recovery |
|---|---|---|
| File not found | Manifest missing or corrupt | Re-commit original |
| Unrecoverable | Too many shards lost | Restore from backup |
| Parse error | Malformed manifest.json | Manual intervention |
{archive}/
data.dat
parity_0.dat, parity_1.dat, parity_2.dat
manifest.json
{archive}/
segments/
segment_0.dat ... segment_N.dat
parity/
segment_0_parity_0.dat ... segment_N_parity_2.dat
manifest.json
{archive}/
blocks/
block_0/
segments/segment_0.dat ... segment_29.dat
parity/block_parity_0.dat ... block_parity_2.dat
block_1/
...
manifest.json