Skip to content

devcore24/ShardSeal

Repository files navigation

ShardSeal - Open S3-compatible, self-healing object store written in Go.

(Work in progress)

Project Status & Goals

Current State

This is an experimental project in early development, primarily designed for:

  • Understanding distributed storage system internals
  • Testing novel approaches to erasure coding and data placement algorithms
  • Learning S3 protocol implementation details
  • Experimenting with self-healing storage architectures

This is NOT production-ready software.


  • Implemented
    • S3 basics: ListBuckets (/), CreateBucket (PUT /{bucket}), DeleteBucket (DELETE /{bucket})
    • Objects: Put (PUT /{bucket}/{key}), Get (GET), Head (HEAD), Delete (DELETE)
    • Range GET support (single range, requires seekable storage)
    • ListObjectsV2 (bucket object listing with prefix, delimiter, common prefixes, pagination)
    • Multipart uploads (initiate/upload-part/complete/abort)
    • Multipart: streaming completion with S3-compatible ETag (MD5 of part ETags + -N)
    • Config (YAML + env), structured logging, CI
    • Prometheus metrics (/metrics) and HTTP instrumentation
    • Tracing: OpenTelemetry scaffold (optional; OTLP gRPC/HTTP); spans include s3.error_code; optional s3.key_hash via config
    • Authentication: AWS Signature V4 (optional; header and presigned URL) with clock-skew enforcement and X-Amz-Expires validation
    • Local filesystem storage backend (dev/MVP), in-memory metadata store
  • Admin API (optional, separate port) with optional OIDC + RBAC: /admin/health, /admin/version; multipart GC endpoint (/admin/gc/multipart)
  • Repair pipeline (experimental): sealed integrity failures during GET/HEAD and scrubber scans enqueue repair items to an in-memory queue; a background repair worker runs as a no-op with admin controls
  • Repair worker (single-shard rewrite): validates payload hashes, regenerates sealed headers/footers, updates manifests, and exports success/failure metrics
  • Repair queue/worker can be enabled via config even when the Admin API is disabled (set repair.enabled: true / SHARDSEAL_REPAIR_ENABLED=true); storage and scrubber enqueues continue and metrics are exported
  • Unit tests for buckets/objects/multipart
    • Production-ready fixes: Streaming multipart completion, safe range handling, improved error logging, manifest fsync after atomic writes
  • Not yet implemented / in progress
    • Self-healing (erasure coding and background rewriter): verification-only scrubber implemented; integrity failures are enqueued for repair, but the worker is currently a no-op (no healing yet). Sealed I/O and integrity verification are available behind feature flags.
    • Distributed metadata/placement

Roadmap / TODO (Summary)

  • High priority
    • Extend the repair worker to multi-shard/RS layouts (streaming rewrite + backoff)
    • Add repair orchestration controls (reason-aware scheduling, rate limiting, queue histograms surfaced to admin/UI)
    • Expand SigV4 coverage for chunked uploads and odd canonicalization cases (e.g., duplicate headers, session tokens)
  • Short term
    • S3 op metrics for API (get/put/head/delete/list/multipart)
    • Admin: scrubber pause/resume endpoints
    • Sealed range tests for payload section reads
    • Docs: capture repair queue configuration + admin host-port override tips, and document dashboard/alert wiring for queue depth metrics
  • Medium term
    • Real RS codec and multi-shard layout; reconstruct on read
    • Placement ring across dataDirs; prep for multi-node
    • Repair worker: reconstruct + rewrite with retry/backoff
  • See project.md for the full, prioritized list.

Quick start

Prerequisites

  • Go 1.22+ installed

Build and run

make build
# Run with sample config (will ensure ./data exists)
SHARDSEAL_CONFIG=configs/local.yaml make run
# Or
# go run ./cmd/shardseal

Default address: :8080 (override with env SHARDSEAL_ADDR).
Data dirs: ./data (override with env SHARDSEAL_DATA_DIRS as comma-separated list).

Using with curl (auth disabled by default; SigV4 optional)

Bucket naming: 3-63 chars; lowercase letters, digits, dots, hyphens; must start/end with letter or digit.

# List all buckets
curl -v http://localhost:8080/

# Create a bucket
curl -v -X PUT http://localhost:8080/my-bucket

# Put an object (from stdin)
printf 'Hello, ShardSeal!\n' | curl -v -X PUT http://localhost:8080/my-bucket/hello.txt --data-binary @-

# Get an object
curl -v http://localhost:8080/my-bucket/hello.txt

# Range GET (first 10 bytes)
curl -v -H 'Range: bytes=0-9' http://localhost:8080/my-bucket/hello.txt

# Head object
curl -I http://localhost:8080/my-bucket/hello.txt

# List objects in bucket
curl -s "http://localhost:8080/my-bucket?list-type=2"

# List with prefix filter
curl -s "http://localhost:8080/my-bucket?list-type=2&prefix=folder/"

# Delete object
curl -X DELETE http://localhost:8080/my-bucket/hello.txt

# Delete bucket (must be empty - excludes internal .multipart files)
curl -X DELETE http://localhost:8080/my-bucket

Multipart upload example (ETag behavior)

  • Requirements: Admin API not required. Ensure bucket exists. Example uses two parts.
  • After completion, ETag equals MD5 of concatenated part MD5s with "-N" suffix.
bucket=my-bucket
object=big.bin

# 1) Initiate multipart upload
uploadId=$(curl -s -X POST "http://localhost:8080/$bucket/$object?uploads" \
  | sed -n 's:.*<UploadId>\(.*\)</UploadId>.*:\1:p')
echo "UploadId=$uploadId"

# 2) Upload two parts; capture each returned ETag from response headers
part1ETag=$(printf 'A%.0s' {1..6000000} | \
  curl -s -i -X PUT "http://localhost:8080/$bucket/$object?partNumber=1&uploadId=$uploadId" \
       --data-binary @- | tr -d '\r' | awk -F': ' '/^ETag:/ {gsub(/\"/,"",$2); print $2}')

part2ETag=$(printf 'B%.0s' {1..6000000} | \
  curl -s -i -X PUT "http://localhost:8080/$bucket/$object?partNumber=2&uploadId=$uploadId" \
       --data-binary @- | tr -d '\r' | awk -F': ' '/^ETag:/ {gsub(/\"/,"",$2); print $2}')

echo "Part1 ETag=$part1ETag" ; echo "Part2 ETag=$part2ETag"

# 3) Complete using the part list; server streams parts and returns multipart ETag
cat > complete.xml <<XML
<CompleteMultipartUpload>
  <Part><PartNumber>1</PartNumber><ETag>"$part1ETag"</ETag></Part>
  <Part><PartNumber>2</PartNumber><ETag>"$part2ETag"</ETag></Part>
</CompleteMultipartUpload>
XML

curl -s -X POST "http://localhost:8080/$bucket/$object?uploadId=$uploadId" \
     -H 'Content-Type: application/xml' --data-binary @complete.xml
# Response ETag => md5(concat(md5(part1), md5(part2))) - 2

# 4) Verify object is retrievable
curl -I "http://localhost:8080/$bucket/$object"

Testing

go test ./...
# Verbose tests for just the S3 API package
go test ./pkg/api/s3 -v

Docker (dev)

Two options are provided: local Docker build and docker-compose. The image exposes:

  • 8080: S3 data-plane (configurable via SHARDSEAL_ADDR)
  • 9090: Admin API (when adminAddress is configured; docker-compose publishes this on host port ${SHARDSEAL_ADMIN_HOST_PORT:-19090} to avoid clashes with local Prometheus instances)

Build and run (Dockerfile)

# Build the image locally
docker build -t shardseal:dev .

# Run with a mounted data directory and config
# Ensure your config mounts to /home/app/config/config.yaml or set SHARDSEAL_CONFIG accordingly.
docker run --rm -p 8080:8080 -p 9090:9090 \
  -v "$(pwd)/data:/home/app/data" \
  -v "$(pwd)/configs:/home/app/config:ro" \
  -e SHARDSEAL_CONFIG=/home/app/config/local.yaml \
  --name shardseal shardseal:dev

Compose (docker-compose.yml)

# Up/Down
docker compose up --build
docker compose down

# Override env from your shell or edit docker-compose.yml as needed.
# Data is mounted at ./data, config at ./configs (read-only) by default.

Notes:

  • The container user is a non-root user (app). Data and config are mounted under /home/app.
  • To enable Admin API, configure adminAddress in the config or set SHARDSEAL_ADMIN_ADDR (see configs/local.yaml and cmd.shardseal.main).
  • By default the compose file publishes the admin listener on host port ${SHARDSEAL_ADMIN_HOST_PORT:-19090} (container still listens on :9090). Export SHARDSEAL_ADMIN_HOST_PORT=9090 before docker compose up if the default 9090 is free on your machine.
  • Repair queue priorities: read-time integrity failures run at highest priority, scrub detections at normal priority, and admin-enqueued tasks at low priority. Metrics are tagged by reason/result for dashboards/alerts.
  • Sealed mode can be enabled via:
    • YAML: sealed.enabled: true
    • Env: SHARDSEAL_SEALED_ENABLED=true
  • Integrity scrubber (experimental verification-only) can be enabled via:
    • Env: SHARDSEAL_SCRUBBER_ENABLED=true
    • Optional overrides:
      • SHARDSEAL_SCRUBBER_INTERVAL=1h
      • SHARDSEAL_SCRUBBER_CONCURRENCY=2
      • SHARDSEAL_SCRUBBER_VERIFY_PAYLOAD=true # overrides sealed.verifyOnRead inheritance
  • Admin scrub endpoints (experimental, sealed integrity verification):
    • GET /admin/scrub/stats (RBAC: admin.read)
    • POST /admin/scrub/runonce (RBAC: admin.scrub)
    • The scrubber verifies sealed headers/footers and compares footer content-hash to the manifest. Payload re-hash verification is enabled when sealed.verifyOnRead is true (or forced via SHARDSEAL_SCRUBBER_VERIFY_PAYLOAD). Protect these with OIDC/RBAC as needed (see security.oidc.rbac and cmd.shardseal.main).
  • Repair pipeline (experimental): when Admin API is enabled, an in-memory repair queue is created. The storage layer enqueues items on sealed integrity failures during GET/HEAD, and the scrubber enqueues detected failures. A background repair worker starts (currently a no-op) and can be inspected/controlled via admin endpoints.
  • The provided docker-compose.yml includes commented environment toggles for sealed mode, scrubber, tracing, admin OIDC, and GC; uncomment to enable as needed.

Admin repair examples

Enable Admin API (e.g., SHARDSEAL_ADMIN_ADDR=:9090). If OIDC is enabled, include a valid Bearer token; otherwise these endpoints are unauthenticated. When using the provided docker-compose file, ShardSeal publishes the admin listener on host port ${SHARDSEAL_ADMIN_HOST_PORT:-19090} (default 19090), so use http://localhost:19090/admin/health (or whichever host port you exported) when running health checks from the host.

# Queue length
curl -s http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/repair/stats

# Enqueue a repair item (e.g., detected externally)
curl -s -X POST http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/repair/enqueue \
  -H 'Content-Type: application/json' \
  -d '{
    "bucket":"bkt",
    "key":"dir/obj.txt",
    "shardPath":"objects/bkt/dir/obj.txt/data.ss1",
    "reason":"admin"
  }'

# Scrubber controls
curl -s http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/scrub/stats
curl -s -X POST http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/scrub/runonce

# Repair worker controls
curl -s http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/repair/worker/stats
curl -s -X POST http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/repair/worker/pause
curl -s -X POST http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/repair/worker/resume

Notes on authentication (OIDC)

  • Enable OIDC via config (oidc.*) or env (SHARDSEAL_OIDC_*). Set issuer (or jwksURL) and expected clientID/audience.
  • Obtain a JWT from your IdP (ID token or access token) whose aud matches the configured audience.
  • Pass the token in the Authorization header:
    • Example: curl -H "Authorization: Bearer $TOKEN" http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/repair/stats
  • Health/version exemptions: if configured, /admin/health and /admin/version can be accessed without a token.
  • RBAC: endpoints require roles like admin.read, admin.scrub, admin.repair.* (see pkg/security/oidc/rbac.go).

Note: The repair queue/worker can be enabled without the Admin API via config (repair.enabled: true). In that case, the queue and worker run in the background, and metrics are exported; admin endpoints are simply unavailable.

Metrics

  • Exposes Prometheus metrics at /metrics on the same HTTP server.
  • Default counters and histograms include:
    • shardseal_http_requests_total{method,code}
    • shardseal_http_request_duration_seconds_bucket/sum/count{method,code}
    • shardseal_http_inflight_requests
    • shardseal_storage_bytes_total{op}
    • shardseal_storage_ops_total{op,result}
    • shardseal_storage_op_duration_seconds_bucket/sum/count{op}
    • shardseal_storage_sealed_ops_total{op,sealed,result,integrity_fail}
    • shardseal_storage_sealed_op_duration_seconds_bucket/sum/count{op,sealed,integrity_fail}
    • shardseal_storage_integrity_failures_total{op}
    • shardseal_scrubber_scanned_total
    • shardseal_scrubber_errors_total
    • shardseal_scrubber_last_run_timestamp_seconds
    • shardseal_scrubber_uptime_seconds
    • shardseal_repair_queue_depth
    • shardseal_repair_enqueued_total{reason}
    • shardseal_repair_completed_total{result}
    • shardseal_repair_duration_seconds_bucket/sum/count{result}
  • Example:
curl -s http://localhost:8080/metrics | head -n 20

Health endpoints

  • /livez: liveness probe (always OK when process is running)
  • /readyz: readiness probe gated on initialization completion
  • /metrics: Prometheus metrics endpoint

Monitoring (Prometheus + Grafana)

  • Prometheus sample config: configs/monitoring/prometheus/prometheus.yml
  • Example alert rules: configs/monitoring/prometheus/rules.yml
  • Grafana dashboard (import JSON): configs/monitoring/grafana/shardseal_overview.json
  • Includes sealed I/O metrics, scrubber metrics (scanned/errors/last_run/uptime), and repair metrics (queue_depth). The server polls scrubber stats and repair queue length every 10s and exports to the main registry.

Compose profile (optional monitoring stack):

# 1. Bring up shardseal as usual (uses service 'shardseal')
docker compose up --build -d

# 2. Bring up monitoring stack (Prometheus + Grafana) using the 'monitoring' profile
docker compose --profile monitoring up -d

# Access:
# - ShardSeal (S3 plane): http://localhost:8080
# - ShardSeal Admin (if enabled): http://localhost:${SHARDSEAL_ADMIN_HOST_PORT:-19090}/admin/health
# - Prometheus: http://localhost:9091
# - Grafana: http://localhost:3000  (default admin/admin)
#   Add Prometheus data source at http://prometheus:9090 and import the dashboard:
#   configs/monitoring/grafana/shardseal_overview.json

Troubleshooting infos: To clean up stale compose state and networks, and to re-create containers run:

# Stop and remove services/anonymous resources from previous runs
# One liner to remove monitoring and base profiles:
docker compose --profile monitoring down --remove-orphans && docker compose down --remove-orphans 

# remove base profile only
docker compose down --remove-orphans

# Remove dangling user-defined networks that may reference old IDs
docker network prune -f

# (Optional) If Prometheus data retention is not required, remove its anonymous volume too
# docker volume prune -f

# Rebuild and start the base service
docker compose up --build -d

# Start the monitoring profile (creates the explicit shardseal_net if missing)
docker compose --profile monitoring up -d

Validation

Notes:

  • Explicit Docker network: docker-compose.yml defines a bridge network "shardseal_net" and attaches shardseal, prometheus, and grafana to it. This avoids stale/implicit network IDs across runs.
  • Prometheus scrape target: configs/monitoring/prometheus/prometheus.yml uses "shardseal:8080" (service DNS on the Docker network), not "localhost:8080".

Also verify:

  • The Prometheus target inside the container is "shardseal:8080" per configs/monitoring/prometheus/prometheus.yml.
  • The Grafana Prometheus datasource URL is "http://prometheus:9090" (both services share the "shardseal_net" network defined in docker-compose.yml). Tracing and S3 error headers
  • Server spans include: http.method, http.target, http.route, http.status_code, user_agent.original, net.peer.ip, http.server_duration_ms.
  • S3 attributes (low cardinality): s3.op, s3.bucket_present, s3.admin, s3.error. New: s3.error_code on failures; optional s3.key_hash when enabled.
  • Enable s3.key_hash via config (tracing.keyHashEnabled: true) or env (SHARDSEAL_TRACING_KEY_HASH=true). The key hash is sha256(key) truncated to 8 bytes (16 hex chars).
  • Error responses include the header X-S3-Error-Code mirroring the S3 error code for observability. This header is only set on error responses.

Admin endpoints (optional; if admin server enabled). If OIDC is enabled, these endpoints require a valid Bearer token. RBAC defaults are enforced:

  • admin.read for GET endpoints

  • admin.gc for POST /admin/gc/multipart

  • admin.scrub for POST /admin/scrub/runonce

  • admin.repair.read for GET /admin/repair/stats

  • admin.repair.enqueue for POST /admin/repair/enqueue

  • admin.repair.control for POST /admin/repair/worker/pause and /admin/repair/worker/resume

  • /admin/health: JSON status with ready/version/addresses

  • /admin/version: JSON version info

  • POST /admin/gc/multipart: run a single multipart GC pass (requires RBAC admin.gc; OIDC-protected if enabled)

  • /admin/scrub/stats: get current scrubber stats (requires RBAC admin.read)

  • POST /admin/scrub/runonce: trigger a single scrub pass (requires RBAC admin.scrub)

  • /admin/repair/stats: current repair queue length (requires RBAC admin.repair.read)

  • POST /admin/repair/enqueue: enqueue a repair item (requires RBAC admin.repair.enqueue). Body JSON accepts RepairItem fields {bucket, key, shardPath, reason, priority}; discovered timestamp is auto-populated when omitted. The queue is in-memory in this release.

  • /admin/repair/worker/stats: repair worker status and counters (requires RBAC admin.repair.read)

  • POST /admin/repair/worker/pause: pause the repair worker (requires RBAC admin.repair.control)

  • POST /admin/repair/worker/resume: resume the repair worker (requires RBAC admin.repair.control)

Configuration

Example at configs/local.yaml:

address: ":8080"
# Optional admin/control plane on a separate port (read-only endpoints)
# adminAddress: ":9090"

dataDirs:
  - "./data"

# Authentication (optional)
# authMode: "none"        # "none" or "sigv4"
# accessKeys:
#   - accessKey: "AKIAEXAMPLE"
#     secretKey: "secret"
#     user: "local"

# Tracing (optional - OpenTelemetry OTLP)
# tracing:
#   enabled: false
#   endpoint: "localhost:4317"  # grpc default; or "localhost:4318" for http
#   protocol: "grpc"            # "grpc" or "http"
#   sampleRatio: 0.0            # 0.0-1.0
#   serviceName: "shardseal"
#   keyHashEnabled: false      # emit s3.key_hash; or set SHARDSEAL_TRACING_KEY_HASH=true
#
# Sealed mode (experimental)
# sealed:
#   enabled: false
#   verifyOnRead: false
#
# Integrity Scrubber (experimental - verification only)
# Verifies sealed header/footer CRCs and compares footer content-hash with manifest.
# Payload re-hash verification follows sealed.verifyOnRead (enabled when true).
# scrubber:
#   enabled: false
#   interval: "1h"
#   concurrency: 1

# Repair pipeline (optional; can run without Admin API)
# repair:
#   enabled: false            # when true, create repair queue and wire storage/scrubber
#   workerEnabled: true       # start background repair worker (no-op in current milestone)
#   workerConcurrency: 1

Additional optional request size limits:

# Request size limits (optional)
limits:
  singlePutMaxBytes: 5368709120    # 5 GiB cap for single PUT
  minMultipartPartSize: 5242880    # 5 MiB minimum for non-final multipart parts
Environment overrides:
- SHARDSEAL_CONFIG                 // path to YAML config
- SHARDSEAL_ADDR                   // data-plane listen address (e.g., 0.0.0.0:8080)
- SHARDSEAL_ADMIN_ADDR             // admin-plane listen address (e.g., 0.0.0.0:9090) to enable admin endpoints
- SHARDSEAL_DATA_DIRS              // comma-separated data directories
- SHARDSEAL_AUTH_MODE              // "none" (default) or "sigv4"
- SHARDSEAL_ACCESS_KEYS            // comma-separated ACCESS_KEY:SECRET_KEY[:USER]
- SHARDSEAL_TRACING_ENABLED        // "true"/"false"
- SHARDSEAL_TRACING_ENDPOINT       // e.g., localhost:4317 (grpc) or localhost:4318 (http)
- SHARDSEAL_TRACING_PROTOCOL       // "grpc" or "http"
- SHARDSEAL_TRACING_SAMPLE         // 0.0 - 1.0
- SHARDSEAL_TRACING_SERVICE        // service.name override
- SHARDSEAL_TRACING_KEY_HASH       // "true"/"false"; when true, emit s3.key_hash (sha256 first 8 bytes hex of object key)
- SHARDSEAL_SEALED_ENABLED         // "true"/"false" to store objects using sealed format (experimental)
- SHARDSEAL_SEALED_VERIFY_ON_READ  // "true"/"false" to verify integrity on GET/HEAD
- SHARDSEAL_SCRUBBER_ENABLED       // "true"/"false" to enable background scrubber
- SHARDSEAL_SCRUBBER_INTERVAL      // e.g., "1h"
- SHARDSEAL_SCRUBBER_CONCURRENCY   // e.g., "2"
- SHARDSEAL_SCRUBBER_VERIFY_PAYLOAD // "true"/"false" to force payload re-hash verification (overrides sealed.verifyOnRead inheritance)
- SHARDSEAL_GC_ENABLED             // "true"/"false" to enable multipart GC
- SHARDSEAL_GC_INTERVAL            // e.g., "15m"
- SHARDSEAL_GC_OLDER_THAN          // e.g., "24h"
- SHARDSEAL_OIDC_ENABLED           // "true"/"false" to protect Admin API with OIDC
- SHARDSEAL_OIDC_ISSUER            // issuer URL for discovery (preferred)
- SHARDSEAL_OIDC_CLIENT_ID         // expected client_id (audience)
- SHARDSEAL_OIDC_AUDIENCE          // optional, overrides client_id
- SHARDSEAL_OIDC_JWKS_URL          // direct JWKS URL alternative to issuer
- SHARDSEAL_OIDC_ALLOW_UNAUTH_HEALTH   // "true"/"false" to allow unauthenticated /admin/health
- SHARDSEAL_OIDC_ALLOW_UNAUTH_VERSION  // "true"/"false" to allow unauthenticated /admin/version
- SHARDSEAL_LIMIT_SINGLE_PUT_MAX_BYTES     // e.g., 5368709120 (5 GiB)
- SHARDSEAL_LIMIT_MIN_MULTIPART_PART_SIZE  // e.g., 5242880 (5 MiB)
- SHARDSEAL_REPAIR_ENABLED                 // "true"/"false" to enable repair queue without Admin API
- SHARDSEAL_REPAIR_WORKER_ENABLED          // "true"/"false" to start repair worker
- SHARDSEAL_REPAIR_WORKER_CONCURRENCY      // integer >= 1

Sealed mode (experimental)

Summary

  • When enabled, objects are stored as sealed shard files with a header | payload | footer encoding and a JSON manifest persisted alongside per-object metadata. The S3 API remains unchanged (ETag is still MD5 of the payload; SigV4 works the same).
  • Range GETs are served by seeking past the header and reading a SectionReader over just the payload. See storage.localfs and storage.manifest.

On-disk layout

  • Object directory: ./data/objects/{bucket}/{key}/
  • Data file: data.ss1
    • Header (little-endian): magic "ShardSealv1" | version:u16 | headerSize:u16 | payloadLen:u64 | headerCRC32C:u32
    • Footer: contentHash[32] (sha256 of payload) | footerCRC32C:u32
    • Format primitives implemented in erasure.rs with unit tests in erasure.rs_test.
  • Manifest: object.meta (JSON, v1)
    • Records bucket, key, size, ETag (MD5), lastModified, RS params, and a Shards[] slice with path, content hash algo/hex, payload length, header/footer CRCs.

Behavior

  • GET/HEAD prefer sealed objects when a manifest exists; otherwise fall back to plain files (mixing sealed and plain is supported).
  • Range GETs use io.SectionReader on the payload region (efficient partial reads).
  • DELETE removes the sealed shard and the manifest; LIST derives keys from the parent dir of data.ss1 and reads metadata from the manifest. Implementation details in storage.localfs.

Integrity verification (optional)

  • Set sealed.verifyOnRead: true to validate footer CRC and sha256(payload) against the manifest during GET/HEAD.
  • Integrity failures are surfaced as 500 InternalError at the S3 layer and annotated in tracing. S3 mapping handled in api.s3.

Configuration

  • YAML (see sample in configs/local.yaml):
    • sealed.enabled: false (default)
    • sealed.verifyOnRead: false (default)
  • Environment:
    • SHARDSEAL_SEALED_ENABLED=true|false
    • SHARDSEAL_SEALED_VERIFY_ON_READ=true|false
  • Sample config and env wiring in cmd.shardseal.main.

Observability

  • Tracing: storage.sealed=true for sealed ops; storage.integrity_fail=true when verification fails.
  • Prometheus (emitted by obs.metrics.storage):
    • shardseal_storage_bytes_total{op}
    • shardseal_storage_ops_total{op,result}
    • shardseal_storage_op_duration_seconds_bucket/sum/count{op}
    • shardseal_storage_sealed_ops_total{op,sealed,result,integrity_fail}
    • shardseal_storage_sealed_op_duration_seconds_bucket/sum/count{op,sealed,integrity_fail}
    • shardseal_storage_integrity_failures_total{op}

Migration and compatibility

  • Enabling sealed mode affects only newly written objects. Existing plain files remain readable; GET/HEAD fall back to plain when no manifest is present.
  • Disabling sealed mode does not delete existing sealed objects; they continue to be served via manifest. You can transition gradually and mix sealed/plain safely.
  • ETag policy: MD5 of full object payload is preserved for S3 compatibility (even in sealed mode). For CompleteMultipartUpload, the ETag is MD5 of the final combined object (not AWS multipart-style ETag with a dash and part count). This may become configurable in a future release.

Scrubber behavior

  • Performs sealed integrity verification: validates sealed headers/footers and footer content-hash against the manifest; optional payload re-hash when sealed.verifyOnRead is true.
  • See the Admin endpoints section above for routes and RBAC.

Authentication (optional SigV4)

  • Disabled by default. Enable verification and provide credentials either via config or environment:
export SHARDSEAL_AUTH_MODE=sigv4
export SHARDSEAL_ACCESS_KEYS='AKIAEXAMPLE:secret:local'
# Run server after setting env
SHARDSEAL_CONFIG=configs/local.yaml make run

When enabled, the server requires valid AWS Signature V4 on S3 requests (both Authorization header and presigned URLs are supported). Health endpoints (/livez, /readyz, /metrics) remain unauthenticated.

Notes & limitations (current MVP)

  • Authentication: optional. AWS SigV4 supported (header and presigned; disabled by default via config/env).
  • ETag is MD5 of full object for single-part PUTs; for multipart completes, ETag is also MD5 of the full final object (not AWS multipart-style ETag).
  • Objects stored under ./data/objects/{bucket}/{key}
  • Multipart temporary parts stored in separate staging bucket: .multipart////part.N (excluded from user listings and bucket empty checks; cleaned up on complete/abort)
  • Range requests require seekable storage (LocalFS supports this)
  • Single PUT size cap: 5 GiB (configurable via limits.singlePutMaxBytes or env SHARDSEAL_LIMIT_SINGLE_PUT_MAX_BYTES). Larger uploads must use Multipart Upload (responds with S3 error code EntityTooLarge).
  • Error detail: EntityTooLarge responses include MaxAllowedSize and a hint to use Multipart Upload.
  • Multipart part size: 5 MiB minimum for all parts except the final part (configurable via limits.minMultipartPartSize or env SHARDSEAL_LIMIT_MIN_MULTIPART_PART_SIZE). Intended for S3 compatibility; very small multi-part aggregates used in tests may bypass this check.
  • LocalFS writes are atomic via temp+rename on Put, reducing risk of partial files on error.

Recent Improvements (2025-10-29)

  • Implemented AWS SigV4 authentication verification (headers and presigned) with unit tests
  • Exposed Prometheus metrics at /metrics and added HTTP instrumentation middleware
  • Added liveness (/livez) and readiness (/readyz) endpoints; readiness gated after initialization
  • Fixed critical memory issues: streaming multipart completion; safe handling for non-seekable Range GET
  • Hid internal multipart files from listings and bucket-empty checks; normalized temp part layout

Recent Improvements (2025-10-30)

  • Tracing enrichment: error responses now set X-S3-Error-Code; tracing middleware records s3.error_code.
  • Optional s3.key_hash attribute on spans (sha256(key) truncated to 8 bytes hex), configurable via tracing.keyHashEnabled or env SHARDSEAL_TRACING_KEY_HASH=true.
  • README, sample config, and tests updated accordingly.

Recent Improvements (2025-10-31)

  • ShardSeal v1 sealed mode (experimental, feature-flagged):
    • LocalFS now writes sealed shard files (header | payload | footer) and persists a JSON manifest; Range GETs are served via a SectionReader. Delete/List are aware of sealed layout (see storage.localfs, storage.manifest).
    • Optional verifyOnRead validates footer CRC and sha256(payload) on GET/HEAD; integrity failures are mapped to 500 InternalError at the S3 layer (see api.s3).
    • Tests added for storage-level and S3-level sealed behavior including corruption detection (see storage.localfs_sealed_test, api.s3.server_sealed_test).
    • Observability: tracing annotates storage.sealed and storage.integrity_fail; Prometheus sealed I/O metrics added (see obs.metrics.storage).

Recent Improvements (2025-11-01)

  • Admin repair control surface:
    • Queue endpoints: GET /admin/repair/stats, POST /admin/repair/enqueue
    • Worker endpoints: GET /admin/repair/worker/stats, POST /admin/repair/worker/pause, POST /admin/repair/worker/resume
    • RBAC roles: admin.repair.read, admin.repair.enqueue, admin.repair.control
  • Observability:
    • shardseal_repair_queue_depth metric with periodic polling
    • Prometheus recording rules and alerts for repair queue depth
    • Grafana panels for repair queue depth (stat and timeseries)
  • Documentation: Admin endpoints, RBAC roles, and monitoring sections updated to reflect current state.

Recent Improvements (2025-11-02)

  • Fixed the sealed shard header encoder so the emitted byte length always matches the advertised header size, preventing payload shifts after repairs or rewrites.
  • Added regression testing around header-size invariants to catch future encoder regressions early.

Roadmap (short)

  1. ShardSeal v1 storage format + erasure coding
  2. Background scrubber and self-healing
  3. Admin API hardening (OIDC/RBAC), monitoring assets (dashboards/alerts)

One-liner reset (cleanup + rebuild + monitoring)

Use this single command to fully reset the stack, remove stale networks/containers from older runs, rebuild, and bring up monitoring:

bash -lc 'docker compose --profile monitoring down --remove-orphans; docker compose down --remove-orphans; docker ps -q --filter network=shardseal_net | xargs -r docker rm -f; docker network rm shardseal_net 2>/dev/null || true; docker network prune -f; docker compose up --build -d; docker compose --profile monitoring up -d'

Notes:

  • This works even if an older Compose run created a legacy fixed-name network "shardseal_net" with stray containers still attached.
  • For current versions of this repo, the Compose network is project-scoped (no fixed name) per docker-compose.yml; docker compose down --remove-orphans will remove it automatically unless unrelated containers attach to it.
  • Prometheus scrapes the service via Docker DNS at shardseal:8080 per configs/monitoring/prometheus/prometheus.yml.

If you prefer step-by-step commands, see the "Troubleshooting infos" and "Validation" sections above.

How to run locally, but needs adjustments for prometheus and grafana. (better use docker πŸ˜‰ ):

Quick start:

# 1) Run shardseal (default :8080 exposes /metrics)
SHARDSEAL_CONFIG=configs/local.yaml make run

# 2) Start Prometheus (adjust path as needed)
prometheus --config.file=configs/monitoring/prometheus/prometheus.yml

# 3) Import Grafana dashboard JSON:
#    configs/monitoring/grafana/shardseal_overview.json
#    and set the Prometheus datasource accordingly.

License

AGPL-3.0-or-later

Contributing

Early-stage experimental project β€” contributions welcome, especially in areas of:

  • Erasure coding implementations
  • Distributed systems algorithms
  • Storage integrity verification techniques
  • Performance optimizations

Please keep code documented and tested. Note that the project structure and APIs may change significantly as the design evolves.

About

ShardSeal - Open S3-compatible, self-healing object store written in Go. (WIP)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages