-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Milestone
Description
Description
To ensure resilient, cost-effective, and high-performance distribution of datasets, we need a formalized an algorithm that accounts for the specific technical, geographic, and laws of our current data providers. This issue focuses on mapping out an optimal distribution algorithm that dictates routing based on file size, count, regional redundancy and laws.
📊 Provider Landscape & Constraints
| Provider | Constraints / Limits | Strategic Notes |
|---|---|---|
| Zenodo | 100 files / 50GB per record (up to 200GB) | Ideal for archival, DOI provider; strict file count limits. |
| Internet Archive | 500 files / 500GB (5k files/day cap) | High capacity, but ingestion is often throttled. |
| Cloudflare R2 | Unlimited bandwidth | Need to limit range requests (Worker costs). R2 limits, Workers limits, Requires regional replication. |
| Source Coop | Most permissive | Located in us-west-2 (seismic risk); requires secondary geographic redundancy. |
| P2P (libp2p/IPFS) | Infrastructure dependent | Leverages cheap d4c-infra-distribution infrastructure for decentralized dissemination. [See #9] |
🗺️ Logic Flowchart (Draft)
This diagram outlines a sample decision-making process for dataset routing. Pricing will definitely be a factor, as well as how "nice" we want to be to our hosts (Source Cooperative, Zenodo, Internet Archive). For high-value datasets, I am willing to put them up on multiple providers.
graph TD
A[New Dataset] --> B{Total Size?}
B -- "< 50GB & < 100 Files" --> C[Zenodo Archival]
B -- "> 50GB or > 500 Files" --> D[Source Cooperative]
D --> E{Redundancy Needed?}
E -- "Yes (Geographic)" --> F[Replicate to Cloudflare R2]
E -- "Yes (P2P)" --> G[Disseminate via libp2p/IPFS]
F --> H{Access Pattern?}
H -- "High Volume" --> I[Implement Range Request Limits]
H -- "Global" --> J[Regional Peering Benchmarking]
🎯 Objectives & TODOs
1. Benchmarking
- Infrastructure Speed Test: Measure upload/egress speeds from our current infrastructure to each provider.
- Network Stability: Document latency and peering issues encountered with Cloudflare R2.
- P2P Performance: Test libp2p] dissemination speed using
d4c-infra-distributionsmart nodes.
2. Documentation & Optimization
- Finalize modelling MVP - Experiment on moving data between various storage systems (ex. Source Coop, Internet Archive, Cloudflare R2) #11
- Cost Analysis: Model Cloudflare Worker costs vs. egress savings for large-scale range requests.
- Risk Mitigation: Define the "Earthquake Strategy" for data hosted in
us-west-2(Source Coop). - Provider Comparison: Finalize a "Source of Truth" document for all provider limits and API capabilities.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels