Skip to content

Architecting Optimal Distribution Algorithm & Adaptive Provider Strategy #25

@diegoripley

Description

@diegoripley

Description

To ensure resilient, cost-effective, and high-performance distribution of datasets, we need a formalized an algorithm that accounts for the specific technical, geographic, and laws of our current data providers. This issue focuses on mapping out an optimal distribution algorithm that dictates routing based on file size, count, regional redundancy and laws.


📊 Provider Landscape & Constraints

Provider Constraints / Limits Strategic Notes
Zenodo 100 files / 50GB per record (up to 200GB) Ideal for archival, DOI provider; strict file count limits.
Internet Archive 500 files / 500GB (5k files/day cap) High capacity, but ingestion is often throttled.
Cloudflare R2 Unlimited bandwidth Need to limit range requests (Worker costs). R2 limits, Workers limits, Requires regional replication.
Source Coop Most permissive Located in us-west-2 (seismic risk); requires secondary geographic redundancy.
P2P (libp2p/IPFS) Infrastructure dependent Leverages cheap d4c-infra-distribution infrastructure for decentralized dissemination. [See #9]

🗺️ Logic Flowchart (Draft)

This diagram outlines a sample decision-making process for dataset routing. Pricing will definitely be a factor, as well as how "nice" we want to be to our hosts (Source Cooperative, Zenodo, Internet Archive). For high-value datasets, I am willing to put them up on multiple providers.

graph TD
    A[New Dataset] --> B{Total Size?}
    B -- "< 50GB & < 100 Files" --> C[Zenodo Archival]
    B -- "> 50GB or > 500 Files" --> D[Source Cooperative]
    D --> E{Redundancy Needed?}
    E -- "Yes (Geographic)" --> F[Replicate to Cloudflare R2]
    E -- "Yes (P2P)" --> G[Disseminate via libp2p/IPFS]
    F --> H{Access Pattern?}
    H -- "High Volume" --> I[Implement Range Request Limits]
    H -- "Global" --> J[Regional Peering Benchmarking]
Loading

🎯 Objectives & TODOs

1. Benchmarking

  • Infrastructure Speed Test: Measure upload/egress speeds from our current infrastructure to each provider.
  • Network Stability: Document latency and peering issues encountered with Cloudflare R2.
  • P2P Performance: Test libp2p] dissemination speed using d4c-infra-distribution smart nodes.

2. Documentation & Optimization

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions