Skip to content

Global Bandwidth Infrastructure for Data for Canada (Target: 1 TB/s) #33

@diegoripley

Description

@diegoripley

Spent about 20 minutes with ChatGPT to create this issue. My plans are actually bigger than this 🤓😎✌️

Summary

To support the long-term goals of Data for Canada, we should explore provisioning a globally distributed network of VPS instances, seed boxes, and low-cost hosting providers with an aggregate bandwidth target of ~1 TB/s to a goal of 50TB/s dedicated bandwidth within the first year plus the CDN bandwidth.

The purpose of this infrastructure would be to improve data distribution, redundancy, and availability for large public datasets hosted or distributed by the project. It will also feed into the ingestor pipeline so we will be able to utilize CloudFlare CDN to transmit to final destination efficiently.

Motivation

Data for Canada aims to make Canadian public data easily accessible, reproducible, and distributable. As datasets grow (especially census, geospatial, statistical datasets, orthoimagery, and most importantly field imagery and health related datasets), bandwidth and geographic distribution become critical. We will utilize this same infrastructure for the Data for the Universe project.

A distributed infrastructure could provide:

  • Faster downloads for users in different regions
  • Redundancy in case of outages
  • Better distribution of large datasets
  • Improved resilience against hosting limits or throttling, or unexpected events
  • Support for future services such as mirrors, torrents, or dataset seeding

Proposed Approach

Provision infrastructure across multiple regions using a combination of:

  • VPS providers
  • Seedbox providers
  • Budget hosting services
  • Community-contributed nodes
  • CDNs (ex. CloudFlare, Akamai, EdgeNet, Fastly, Bunny.net, Cloud CDN, etc)

The goal would be to aggregate global bandwidth capacity, rather than relying on a single hosting provider.

Potential uses for nodes:

  • Dataset mirrors
  • Torrent seeders
  • Static file distribution
  • CDN-like edge distribution
  • Backup nodes for dataset packages

Rough Bandwidth Target

Initial long-term goal: ~1 TB/s aggregate bandwidth capacity within first week, 10TB/s within first month, I can do it super cheap 😆.

This does not need to come from a single provider and can be achieved through a distributed network of many smaller hosts.

Implementation Ideas

Possible architecture:

  • Regional mirrors (All continents minus Antarctica 🐧🥳)
  • Torrent seeding for large datasets
  • Object storage replication
  • Geo-distributed download endpoints
  • Automated selective synchronization between nodes

Tools that will be used

  • rclone
  • BitTorrent
  • libp2p
  • IPFS
  • CDN frontends
  • GitHub Releases + mirrors
  • CloudFlare Containers
  • Hardened container images

Tasks

  • Research affordable VPS providers with high bandwidth limits
  • Identify seedbox providers suitable for dataset seeding
  • Hardened container images
  • Define node roles (mirror, seeder, archive)
  • Create automated dataset sync system
  • Design mirror directory structure
  • Evaluate cost vs bandwidth efficiency
  • Recruit volunteer mirror operators
  • Document infrastructure architecture

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions