(Work in progress)
This is an experimental project in early development, primarily designed for:
- Understanding distributed storage system internals
- Testing novel approaches to erasure coding and data placement algorithms
- Learning S3 protocol implementation details
- Experimenting with self-healing storage architectures
This is NOT production-ready software.
- Implemented
- S3 basics: ListBuckets (/), CreateBucket (PUT /{bucket}), DeleteBucket (DELETE /{bucket})
- Objects: Put (PUT /{bucket}/{key}), Get (GET), Head (HEAD), Delete (DELETE)
- Range GET support (single range, requires seekable storage)
- ListObjectsV2 (bucket object listing with prefix, delimiter, common prefixes, pagination)
- Multipart uploads (initiate/upload-part/complete/abort)
- Multipart: streaming completion with S3-compatible ETag (MD5 of part ETags + -N)
- Config (YAML + env), structured logging, CI
- Prometheus metrics (/metrics) and HTTP instrumentation
- Tracing: OpenTelemetry scaffold (optional; OTLP gRPC/HTTP); spans include s3.error_code; optional s3.key_hash via config
- Authentication: AWS Signature V4 (optional; header and presigned URL) with clock-skew enforcement and X-Amz-Expires validation
- Local filesystem storage backend (dev/MVP), in-memory metadata store
- Admin API (optional, separate port) with optional OIDC + RBAC: /admin/health, /admin/version; multipart GC endpoint (/admin/gc/multipart)
- Repair pipeline (experimental): sealed integrity failures during GET/HEAD and scrubber scans enqueue repair items to an in-memory queue; a background repair worker runs as a no-op with admin controls
- Repair worker (single-shard rewrite): validates payload hashes, regenerates sealed headers/footers, updates manifests, and exports success/failure metrics
- Repair queue/worker can be enabled via config even when the Admin API is disabled (set
repair.enabled: true/SHARDSEAL_REPAIR_ENABLED=true); storage and scrubber enqueues continue and metrics are exported - Unit tests for buckets/objects/multipart
- Production-ready fixes: Streaming multipart completion, safe range handling, improved error logging, manifest fsync after atomic writes
- Not yet implemented / in progress
- Self-healing (erasure coding and background rewriter): verification-only scrubber implemented; integrity failures are enqueued for repair, but the worker is currently a no-op (no healing yet). Sealed I/O and integrity verification are available behind feature flags.
- Distributed metadata/placement
- High priority
- Extend the repair worker to multi-shard/RS layouts (streaming rewrite + backoff)
- Add repair orchestration controls (reason-aware scheduling, rate limiting, queue histograms surfaced to admin/UI)
- Expand SigV4 coverage for chunked uploads and odd canonicalization cases (e.g., duplicate headers, session tokens)
- Short term
- S3 op metrics for API (get/put/head/delete/list/multipart)
- Admin: scrubber pause/resume endpoints
- Sealed range tests for payload section reads
- Docs: capture repair queue configuration + admin host-port override tips, and document dashboard/alert wiring for queue depth metrics
- Medium term
- Real RS codec and multi-shard layout; reconstruct on read
- Placement ring across dataDirs; prep for multi-node
- Repair worker: reconstruct + rewrite with retry/backoff
- See project.md for the full, prioritized list.
- Go 1.22+ installed
make build
# Run with sample config (will ensure ./data exists)
SHARDSEAL_CONFIG=configs/local.yaml make run
# Or
# go run ./cmd/shardsealDefault address: :8080 (override with env SHARDSEAL_ADDR).
Data dirs: ./data (override with env SHARDSEAL_DATA_DIRS as comma-separated list).
Bucket naming: 3-63 chars; lowercase letters, digits, dots, hyphens; must start/end with letter or digit.
# List all buckets
curl -v http://localhost:8080/
# Create a bucket
curl -v -X PUT http://localhost:8080/my-bucket
# Put an object (from stdin)
printf 'Hello, ShardSeal!\n' | curl -v -X PUT http://localhost:8080/my-bucket/hello.txt --data-binary @-
# Get an object
curl -v http://localhost:8080/my-bucket/hello.txt
# Range GET (first 10 bytes)
curl -v -H 'Range: bytes=0-9' http://localhost:8080/my-bucket/hello.txt
# Head object
curl -I http://localhost:8080/my-bucket/hello.txt
# List objects in bucket
curl -s "http://localhost:8080/my-bucket?list-type=2"
# List with prefix filter
curl -s "http://localhost:8080/my-bucket?list-type=2&prefix=folder/"
# Delete object
curl -X DELETE http://localhost:8080/my-bucket/hello.txt
# Delete bucket (must be empty - excludes internal .multipart files)
curl -X DELETE http://localhost:8080/my-bucket- Requirements: Admin API not required. Ensure bucket exists. Example uses two parts.
- After completion, ETag equals MD5 of concatenated part MD5s with "-N" suffix.
bucket=my-bucket
object=big.bin
# 1) Initiate multipart upload
uploadId=$(curl -s -X POST "http://localhost:8080/$bucket/$object?uploads" \
| sed -n 's:.*<UploadId>\(.*\)</UploadId>.*:\1:p')
echo "UploadId=$uploadId"
# 2) Upload two parts; capture each returned ETag from response headers
part1ETag=$(printf 'A%.0s' {1..6000000} | \
curl -s -i -X PUT "http://localhost:8080/$bucket/$object?partNumber=1&uploadId=$uploadId" \
--data-binary @- | tr -d '\r' | awk -F': ' '/^ETag:/ {gsub(/\"/,"",$2); print $2}')
part2ETag=$(printf 'B%.0s' {1..6000000} | \
curl -s -i -X PUT "http://localhost:8080/$bucket/$object?partNumber=2&uploadId=$uploadId" \
--data-binary @- | tr -d '\r' | awk -F': ' '/^ETag:/ {gsub(/\"/,"",$2); print $2}')
echo "Part1 ETag=$part1ETag" ; echo "Part2 ETag=$part2ETag"
# 3) Complete using the part list; server streams parts and returns multipart ETag
cat > complete.xml <<XML
<CompleteMultipartUpload>
<Part><PartNumber>1</PartNumber><ETag>"$part1ETag"</ETag></Part>
<Part><PartNumber>2</PartNumber><ETag>"$part2ETag"</ETag></Part>
</CompleteMultipartUpload>
XML
curl -s -X POST "http://localhost:8080/$bucket/$object?uploadId=$uploadId" \
-H 'Content-Type: application/xml' --data-binary @complete.xml
# Response ETag => md5(concat(md5(part1), md5(part2))) - 2
# 4) Verify object is retrievable
curl -I "http://localhost:8080/$bucket/$object"go test ./...
# Verbose tests for just the S3 API package
go test ./pkg/api/s3 -vTwo options are provided: local Docker build and docker-compose. The image exposes:
- 8080: S3 data-plane (configurable via SHARDSEAL_ADDR)
- 9090: Admin API (when adminAddress is configured; docker-compose publishes this on host port
${SHARDSEAL_ADMIN_HOST_PORT:-19090}to avoid clashes with local Prometheus instances)
Build and run (Dockerfile)
# Build the image locally
docker build -t shardseal:dev .
# Run with a mounted data directory and config
# Ensure your config mounts to /home/app/config/config.yaml or set SHARDSEAL_CONFIG accordingly.
docker run --rm -p 8080:8080 -p 9090:9090 \
-v "$(pwd)/data:/home/app/data" \
-v "$(pwd)/configs:/home/app/config:ro" \
-e SHARDSEAL_CONFIG=/home/app/config/local.yaml \
--name shardseal shardseal:devCompose (docker-compose.yml)
# Up/Down
docker compose up --build
docker compose down
# Override env from your shell or edit docker-compose.yml as needed.
# Data is mounted at ./data, config at ./configs (read-only) by default.Notes:
- The container user is a non-root user (app). Data and config are mounted under /home/app.
- To enable Admin API, configure adminAddress in the config or set SHARDSEAL_ADMIN_ADDR (see configs/local.yaml and cmd.shardseal.main).
- By default the compose file publishes the admin listener on host port
${SHARDSEAL_ADMIN_HOST_PORT:-19090}(container still listens on :9090). ExportSHARDSEAL_ADMIN_HOST_PORT=9090beforedocker compose upif the default 9090 is free on your machine. - Repair queue priorities: read-time integrity failures run at highest priority, scrub detections at normal priority, and admin-enqueued tasks at low priority. Metrics are tagged by reason/result for dashboards/alerts.
- Sealed mode can be enabled via:
- YAML: sealed.enabled: true
- Env: SHARDSEAL_SEALED_ENABLED=true
- Integrity scrubber (experimental verification-only) can be enabled via:
- Env: SHARDSEAL_SCRUBBER_ENABLED=true
- Optional overrides:
- SHARDSEAL_SCRUBBER_INTERVAL=1h
- SHARDSEAL_SCRUBBER_CONCURRENCY=2
- SHARDSEAL_SCRUBBER_VERIFY_PAYLOAD=true # overrides sealed.verifyOnRead inheritance
- Admin scrub endpoints (experimental, sealed integrity verification):
- GET /admin/scrub/stats (RBAC: admin.read)
- POST /admin/scrub/runonce (RBAC: admin.scrub)
- The scrubber verifies sealed headers/footers and compares footer content-hash to the manifest. Payload re-hash verification is enabled when sealed.verifyOnRead is true (or forced via SHARDSEAL_SCRUBBER_VERIFY_PAYLOAD). Protect these with OIDC/RBAC as needed (see security.oidc.rbac and cmd.shardseal.main).
- Repair pipeline (experimental): when Admin API is enabled, an in-memory repair queue is created. The storage layer enqueues items on sealed integrity failures during GET/HEAD, and the scrubber enqueues detected failures. A background repair worker starts (currently a no-op) and can be inspected/controlled via admin endpoints.
- The provided docker-compose.yml includes commented environment toggles for sealed mode, scrubber, tracing, admin OIDC, and GC; uncomment to enable as needed.
Enable Admin API (e.g., SHARDSEAL_ADMIN_ADDR=:9090). If OIDC is enabled, include a valid Bearer token; otherwise these endpoints are unauthenticated. When using the provided docker-compose file, ShardSeal publishes the admin listener on host port ${SHARDSEAL_ADMIN_HOST_PORT:-19090} (default 19090), so use http://localhost:19090/admin/health (or whichever host port you exported) when running health checks from the host.
# Queue length
curl -s http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/repair/stats
# Enqueue a repair item (e.g., detected externally)
curl -s -X POST http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/repair/enqueue \
-H 'Content-Type: application/json' \
-d '{
"bucket":"bkt",
"key":"dir/obj.txt",
"shardPath":"objects/bkt/dir/obj.txt/data.ss1",
"reason":"admin"
}'
# Scrubber controls
curl -s http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/scrub/stats
curl -s -X POST http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/scrub/runonce
# Repair worker controls
curl -s http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/repair/worker/stats
curl -s -X POST http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/repair/worker/pause
curl -s -X POST http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/repair/worker/resume- Enable OIDC via config (
oidc.*) or env (SHARDSEAL_OIDC_*). Setissuer(orjwksURL) and expectedclientID/audience. - Obtain a JWT from your IdP (ID token or access token) whose
audmatches the configured audience. - Pass the token in the
Authorizationheader:- Example:
curl -H "Authorization: Bearer $TOKEN" http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/repair/stats
- Example:
- Health/version exemptions: if configured,
/admin/healthand/admin/versioncan be accessed without a token. - RBAC: endpoints require roles like
admin.read,admin.scrub,admin.repair.*(seepkg/security/oidc/rbac.go).
Note: The repair queue/worker can be enabled without the Admin API via config (repair.enabled: true). In that case, the queue and worker run in the background, and metrics are exported; admin endpoints are simply unavailable.
- Exposes Prometheus metrics at /metrics on the same HTTP server.
- Default counters and histograms include:
- shardseal_http_requests_total{method,code}
- shardseal_http_request_duration_seconds_bucket/sum/count{method,code}
- shardseal_http_inflight_requests
- shardseal_storage_bytes_total{op}
- shardseal_storage_ops_total{op,result}
- shardseal_storage_op_duration_seconds_bucket/sum/count{op}
- shardseal_storage_sealed_ops_total{op,sealed,result,integrity_fail}
- shardseal_storage_sealed_op_duration_seconds_bucket/sum/count{op,sealed,integrity_fail}
- shardseal_storage_integrity_failures_total{op}
- shardseal_scrubber_scanned_total
- shardseal_scrubber_errors_total
- shardseal_scrubber_last_run_timestamp_seconds
- shardseal_scrubber_uptime_seconds
- shardseal_repair_queue_depth
- shardseal_repair_enqueued_total{reason}
- shardseal_repair_completed_total{result}
- shardseal_repair_duration_seconds_bucket/sum/count{result}
- Example:
curl -s http://localhost:8080/metrics | head -n 20- /livez: liveness probe (always OK when process is running)
- /readyz: readiness probe gated on initialization completion
- /metrics: Prometheus metrics endpoint
- Prometheus sample config: configs/monitoring/prometheus/prometheus.yml
- Example alert rules: configs/monitoring/prometheus/rules.yml
- Grafana dashboard (import JSON): configs/monitoring/grafana/shardseal_overview.json
- Includes sealed I/O metrics, scrubber metrics (scanned/errors/last_run/uptime), and repair metrics (queue_depth). The server polls scrubber stats and repair queue length every 10s and exports to the main registry.
Compose profile (optional monitoring stack):
# 1. Bring up shardseal as usual (uses service 'shardseal')
docker compose up --build -d
# 2. Bring up monitoring stack (Prometheus + Grafana) using the 'monitoring' profile
docker compose --profile monitoring up -d
# Access:
# - ShardSeal (S3 plane): http://localhost:8080
# - ShardSeal Admin (if enabled): http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/health
# - Prometheus: http://localhost:9091
# - Grafana: http://localhost:3000 (default admin/admin)
# Add Prometheus data source at http://prometheus:9090 and import the dashboard:
# configs/monitoring/grafana/shardseal_overview.jsonTroubleshooting infos: To clean up stale compose state and networks, and to re-create containers run:
# Stop and remove services/anonymous resources from previous runs
# One liner to remove monitoring and base profiles:
docker compose --profile monitoring down --remove-orphans && docker compose down --remove-orphans
# remove base profile only
docker compose down --remove-orphans
# Remove dangling user-defined networks that may reference old IDs
docker network prune -f
# (Optional) If Prometheus data retention is not required, remove its anonymous volume too
# docker volume prune -f
# Rebuild and start the base service
docker compose up --build -d
# Start the monitoring profile (creates the explicit shardseal_net if missing)
docker compose --profile monitoring up -dValidation
- ShardSeal: http://localhost:8080
- Admin (if enabled): http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/health (default 19090 when using docker compose; use the admin host port you configured otherwise)
- Prometheus: http://localhost:9091 (Targets page should show shardseal:8080 as UP)
- Grafana: http://localhost:3000 (default admin/admin). Add Prometheus datasource at URL http://prometheus:9090 and import dashboard from configs/monitoring/grafana/shardseal_overview.json
Notes:
- Explicit Docker network: docker-compose.yml defines a bridge network "shardseal_net" and attaches shardseal, prometheus, and grafana to it. This avoids stale/implicit network IDs across runs.
- Prometheus scrape target: configs/monitoring/prometheus/prometheus.yml uses "shardseal:8080" (service DNS on the Docker network), not "localhost:8080".
Also verify:
- The Prometheus target inside the container is "shardseal:8080" per configs/monitoring/prometheus/prometheus.yml.
- The Grafana Prometheus datasource URL is "http://prometheus:9090" (both services share the "shardseal_net" network defined in docker-compose.yml). Tracing and S3 error headers
- Server spans include: http.method, http.target, http.route, http.status_code, user_agent.original, net.peer.ip, http.server_duration_ms.
- S3 attributes (low cardinality): s3.op, s3.bucket_present, s3.admin, s3.error. New: s3.error_code on failures; optional s3.key_hash when enabled.
- Enable s3.key_hash via config (tracing.keyHashEnabled: true) or env (SHARDSEAL_TRACING_KEY_HASH=true). The key hash is sha256(key) truncated to 8 bytes (16 hex chars).
- Error responses include the header X-S3-Error-Code mirroring the S3 error code for observability. This header is only set on error responses.
Admin endpoints (optional; if admin server enabled). If OIDC is enabled, these endpoints require a valid Bearer token. RBAC defaults are enforced:
-
admin.read for GET endpoints
-
admin.gc for POST /admin/gc/multipart
-
admin.scrub for POST /admin/scrub/runonce
-
admin.repair.read for GET /admin/repair/stats
-
admin.repair.enqueue for POST /admin/repair/enqueue
-
admin.repair.control for POST /admin/repair/worker/pause and /admin/repair/worker/resume
-
/admin/health: JSON status with ready/version/addresses
-
/admin/version: JSON version info
-
POST /admin/gc/multipart: run a single multipart GC pass (requires RBAC admin.gc; OIDC-protected if enabled)
-
/admin/scrub/stats: get current scrubber stats (requires RBAC admin.read)
-
POST /admin/scrub/runonce: trigger a single scrub pass (requires RBAC admin.scrub)
-
/admin/repair/stats: current repair queue length (requires RBAC admin.repair.read)
-
POST /admin/repair/enqueue: enqueue a repair item (requires RBAC admin.repair.enqueue). Body JSON accepts RepairItem fields {bucket, key, shardPath, reason, priority}; discovered timestamp is auto-populated when omitted. The queue is in-memory in this release.
-
/admin/repair/worker/stats: repair worker status and counters (requires RBAC admin.repair.read)
-
POST /admin/repair/worker/pause: pause the repair worker (requires RBAC admin.repair.control)
-
POST /admin/repair/worker/resume: resume the repair worker (requires RBAC admin.repair.control)
Example at configs/local.yaml:
address: ":8080"
# Optional admin/control plane on a separate port (read-only endpoints)
# adminAddress: ":9090"
dataDirs:
- "./data"
# Authentication (optional)
# authMode: "none" # "none" or "sigv4"
# accessKeys:
# - accessKey: "AKIAEXAMPLE"
# secretKey: "secret"
# user: "local"
# Tracing (optional - OpenTelemetry OTLP)
# tracing:
# enabled: false
# endpoint: "localhost:4317" # grpc default; or "localhost:4318" for http
# protocol: "grpc" # "grpc" or "http"
# sampleRatio: 0.0 # 0.0-1.0
# serviceName: "shardseal"
# keyHashEnabled: false # emit s3.key_hash; or set SHARDSEAL_TRACING_KEY_HASH=true
#
# Sealed mode (experimental)
# sealed:
# enabled: false
# verifyOnRead: false
#
# Integrity Scrubber (experimental - verification only)
# Verifies sealed header/footer CRCs and compares footer content-hash with manifest.
# Payload re-hash verification follows sealed.verifyOnRead (enabled when true).
# scrubber:
# enabled: false
# interval: "1h"
# concurrency: 1
# Repair pipeline (optional; can run without Admin API)
# repair:
# enabled: false # when true, create repair queue and wire storage/scrubber
# workerEnabled: true # start background repair worker (no-op in current milestone)
# workerConcurrency: 1Additional optional request size limits:
# Request size limits (optional)
limits:
singlePutMaxBytes: 5368709120 # 5 GiB cap for single PUT
minMultipartPartSize: 5242880 # 5 MiB minimum for non-final multipart partsEnvironment overrides: - SHARDSEAL_CONFIG // path to YAML config - SHARDSEAL_ADDR // data-plane listen address (e.g., 0.0.0.0:8080) - SHARDSEAL_ADMIN_ADDR // admin-plane listen address (e.g., 0.0.0.0:9090) to enable admin endpoints - SHARDSEAL_DATA_DIRS // comma-separated data directories - SHARDSEAL_AUTH_MODE // "none" (default) or "sigv4" - SHARDSEAL_ACCESS_KEYS // comma-separated ACCESS_KEY:SECRET_KEY[:USER] - SHARDSEAL_TRACING_ENABLED // "true"/"false" - SHARDSEAL_TRACING_ENDPOINT // e.g., localhost:4317 (grpc) or localhost:4318 (http) - SHARDSEAL_TRACING_PROTOCOL // "grpc" or "http" - SHARDSEAL_TRACING_SAMPLE // 0.0 - 1.0 - SHARDSEAL_TRACING_SERVICE // service.name override - SHARDSEAL_TRACING_KEY_HASH // "true"/"false"; when true, emit s3.key_hash (sha256 first 8 bytes hex of object key) - SHARDSEAL_SEALED_ENABLED // "true"/"false" to store objects using sealed format (experimental) - SHARDSEAL_SEALED_VERIFY_ON_READ // "true"/"false" to verify integrity on GET/HEAD - SHARDSEAL_SCRUBBER_ENABLED // "true"/"false" to enable background scrubber - SHARDSEAL_SCRUBBER_INTERVAL // e.g., "1h" - SHARDSEAL_SCRUBBER_CONCURRENCY // e.g., "2" - SHARDSEAL_SCRUBBER_VERIFY_PAYLOAD // "true"/"false" to force payload re-hash verification (overrides sealed.verifyOnRead inheritance) - SHARDSEAL_GC_ENABLED // "true"/"false" to enable multipart GC - SHARDSEAL_GC_INTERVAL // e.g., "15m" - SHARDSEAL_GC_OLDER_THAN // e.g., "24h" - SHARDSEAL_OIDC_ENABLED // "true"/"false" to protect Admin API with OIDC - SHARDSEAL_OIDC_ISSUER // issuer URL for discovery (preferred) - SHARDSEAL_OIDC_CLIENT_ID // expected client_id (audience) - SHARDSEAL_OIDC_AUDIENCE // optional, overrides client_id - SHARDSEAL_OIDC_JWKS_URL // direct JWKS URL alternative to issuer - SHARDSEAL_OIDC_ALLOW_UNAUTH_HEALTH // "true"/"false" to allow unauthenticated /admin/health - SHARDSEAL_OIDC_ALLOW_UNAUTH_VERSION // "true"/"false" to allow unauthenticated /admin/version - SHARDSEAL_LIMIT_SINGLE_PUT_MAX_BYTES // e.g., 5368709120 (5 GiB) - SHARDSEAL_LIMIT_MIN_MULTIPART_PART_SIZE // e.g., 5242880 (5 MiB) - SHARDSEAL_REPAIR_ENABLED // "true"/"false" to enable repair queue without Admin API - SHARDSEAL_REPAIR_WORKER_ENABLED // "true"/"false" to start repair worker - SHARDSEAL_REPAIR_WORKER_CONCURRENCY // integer >= 1
Summary
- When enabled, objects are stored as sealed shard files with a header | payload | footer encoding and a JSON manifest persisted alongside per-object metadata. The S3 API remains unchanged (ETag is still MD5 of the payload; SigV4 works the same).
- Range GETs are served by seeking past the header and reading a SectionReader over just the payload. See storage.localfs and storage.manifest.
On-disk layout
- Object directory: ./data/objects/{bucket}/{key}/
- Data file: data.ss1
- Header (little-endian): magic "ShardSealv1" | version:u16 | headerSize:u16 | payloadLen:u64 | headerCRC32C:u32
- Footer: contentHash[32] (sha256 of payload) | footerCRC32C:u32
- Format primitives implemented in erasure.rs with unit tests in erasure.rs_test.
- Manifest: object.meta (JSON, v1)
- Records bucket, key, size, ETag (MD5), lastModified, RS params, and a Shards[] slice with path, content hash algo/hex, payload length, header/footer CRCs.
Behavior
- GET/HEAD prefer sealed objects when a manifest exists; otherwise fall back to plain files (mixing sealed and plain is supported).
- Range GETs use io.SectionReader on the payload region (efficient partial reads).
- DELETE removes the sealed shard and the manifest; LIST derives keys from the parent dir of data.ss1 and reads metadata from the manifest. Implementation details in storage.localfs.
Integrity verification (optional)
- Set sealed.verifyOnRead: true to validate footer CRC and sha256(payload) against the manifest during GET/HEAD.
- Integrity failures are surfaced as 500 InternalError at the S3 layer and annotated in tracing. S3 mapping handled in api.s3.
Configuration
- YAML (see sample in configs/local.yaml):
- sealed.enabled: false (default)
- sealed.verifyOnRead: false (default)
- Environment:
- SHARDSEAL_SEALED_ENABLED=true|false
- SHARDSEAL_SEALED_VERIFY_ON_READ=true|false
- Sample config and env wiring in cmd.shardseal.main.
Observability
- Tracing: storage.sealed=true for sealed ops; storage.integrity_fail=true when verification fails.
- Prometheus (emitted by obs.metrics.storage):
- shardseal_storage_bytes_total{op}
- shardseal_storage_ops_total{op,result}
- shardseal_storage_op_duration_seconds_bucket/sum/count{op}
- shardseal_storage_sealed_ops_total{op,sealed,result,integrity_fail}
- shardseal_storage_sealed_op_duration_seconds_bucket/sum/count{op,sealed,integrity_fail}
- shardseal_storage_integrity_failures_total{op}
Migration and compatibility
- Enabling sealed mode affects only newly written objects. Existing plain files remain readable; GET/HEAD fall back to plain when no manifest is present.
- Disabling sealed mode does not delete existing sealed objects; they continue to be served via manifest. You can transition gradually and mix sealed/plain safely.
- ETag policy: MD5 of full object payload is preserved for S3 compatibility (even in sealed mode). For CompleteMultipartUpload, the ETag is MD5 of the final combined object (not AWS multipart-style ETag with a dash and part count). This may become configurable in a future release.
Scrubber behavior
- Performs sealed integrity verification: validates sealed headers/footers and footer content-hash against the manifest; optional payload re-hash when sealed.verifyOnRead is true.
- See the Admin endpoints section above for routes and RBAC.
- Disabled by default. Enable verification and provide credentials either via config or environment:
export SHARDSEAL_AUTH_MODE=sigv4
export SHARDSEAL_ACCESS_KEYS='AKIAEXAMPLE:secret:local'
# Run server after setting env
SHARDSEAL_CONFIG=configs/local.yaml make runWhen enabled, the server requires valid AWS Signature V4 on S3 requests (both Authorization header and presigned URLs are supported). Health endpoints (/livez, /readyz, /metrics) remain unauthenticated.
- Authentication: optional. AWS SigV4 supported (header and presigned; disabled by default via config/env).
- ETag is MD5 of full object for single-part PUTs; for multipart completes, ETag is also MD5 of the full final object (not AWS multipart-style ETag).
- Objects stored under ./data/objects/{bucket}/{key}
- Multipart temporary parts stored in separate staging bucket: .multipart////part.N (excluded from user listings and bucket empty checks; cleaned up on complete/abort)
- Range requests require seekable storage (LocalFS supports this)
- Single PUT size cap: 5 GiB (configurable via limits.singlePutMaxBytes or env SHARDSEAL_LIMIT_SINGLE_PUT_MAX_BYTES). Larger uploads must use Multipart Upload (responds with S3 error code EntityTooLarge).
- Error detail: EntityTooLarge responses include MaxAllowedSize and a hint to use Multipart Upload.
- Multipart part size: 5 MiB minimum for all parts except the final part (configurable via limits.minMultipartPartSize or env SHARDSEAL_LIMIT_MIN_MULTIPART_PART_SIZE). Intended for S3 compatibility; very small multi-part aggregates used in tests may bypass this check.
- LocalFS writes are atomic via temp+rename on Put, reducing risk of partial files on error.
- Implemented AWS SigV4 authentication verification (headers and presigned) with unit tests
- Exposed Prometheus metrics at /metrics and added HTTP instrumentation middleware
- Added liveness (/livez) and readiness (/readyz) endpoints; readiness gated after initialization
- Fixed critical memory issues: streaming multipart completion; safe handling for non-seekable Range GET
- Hid internal multipart files from listings and bucket-empty checks; normalized temp part layout
- Tracing enrichment: error responses now set X-S3-Error-Code; tracing middleware records s3.error_code.
- Optional s3.key_hash attribute on spans (sha256(key) truncated to 8 bytes hex), configurable via tracing.keyHashEnabled or env SHARDSEAL_TRACING_KEY_HASH=true.
- README, sample config, and tests updated accordingly.
- ShardSeal v1 sealed mode (experimental, feature-flagged):
- LocalFS now writes sealed shard files (header | payload | footer) and persists a JSON manifest; Range GETs are served via a SectionReader. Delete/List are aware of sealed layout (see storage.localfs, storage.manifest).
- Optional verifyOnRead validates footer CRC and sha256(payload) on GET/HEAD; integrity failures are mapped to 500 InternalError at the S3 layer (see api.s3).
- Tests added for storage-level and S3-level sealed behavior including corruption detection (see storage.localfs_sealed_test, api.s3.server_sealed_test).
- Observability: tracing annotates storage.sealed and storage.integrity_fail; Prometheus sealed I/O metrics added (see obs.metrics.storage).
- Admin repair control surface:
- Queue endpoints: GET /admin/repair/stats, POST /admin/repair/enqueue
- Worker endpoints: GET /admin/repair/worker/stats, POST /admin/repair/worker/pause, POST /admin/repair/worker/resume
- RBAC roles: admin.repair.read, admin.repair.enqueue, admin.repair.control
- Observability:
- shardseal_repair_queue_depth metric with periodic polling
- Prometheus recording rules and alerts for repair queue depth
- Grafana panels for repair queue depth (stat and timeseries)
- Documentation: Admin endpoints, RBAC roles, and monitoring sections updated to reflect current state.
- Fixed the sealed shard header encoder so the emitted byte length always matches the advertised header size, preventing payload shifts after repairs or rewrites.
- Added regression testing around header-size invariants to catch future encoder regressions early.
- ShardSeal v1 storage format + erasure coding
- Background scrubber and self-healing
- Admin API hardening (OIDC/RBAC), monitoring assets (dashboards/alerts)
Use this single command to fully reset the stack, remove stale networks/containers from older runs, rebuild, and bring up monitoring:
bash -lc 'docker compose --profile monitoring down --remove-orphans; docker compose down --remove-orphans; docker ps -q --filter network=shardseal_net | xargs -r docker rm -f; docker network rm shardseal_net 2>/dev/null || true; docker network prune -f; docker compose up --build -d; docker compose --profile monitoring up -d'Notes:
- This works even if an older Compose run created a legacy fixed-name network "shardseal_net" with stray containers still attached.
- For current versions of this repo, the Compose network is project-scoped (no fixed name) per docker-compose.yml;
docker compose down --remove-orphanswill remove it automatically unless unrelated containers attach to it. - Prometheus scrapes the service via Docker DNS at
shardseal:8080per configs/monitoring/prometheus/prometheus.yml.
If you prefer step-by-step commands, see the "Troubleshooting infos" and "Validation" sections above.
Quick start:
# 1) Run shardseal (default :8080 exposes /metrics)
SHARDSEAL_CONFIG=configs/local.yaml make run
# 2) Start Prometheus (adjust path as needed)
prometheus --config.file=configs/monitoring/prometheus/prometheus.yml
# 3) Import Grafana dashboard JSON:
# configs/monitoring/grafana/shardseal_overview.json
# and set the Prometheus datasource accordingly.AGPL-3.0-or-later
Early-stage experimental project β contributions welcome, especially in areas of:
- Erasure coding implementations
- Distributed systems algorithms
- Storage integrity verification techniques
- Performance optimizations
Please keep code documented and tested. Note that the project structure and APIs may change significantly as the design evolves.
