Global Bandwidth Infrastructure for Data for Canada (Target: 1 TB/s)

Spent about 20 minutes with ChatGPT to create this issue. My plans are actually bigger than this 🤓😎✌️

## Summary

To support the long-term goals of Data for Canada, we should explore provisioning a globally distributed network of VPS instances, seed boxes, and low-cost hosting providers with an aggregate bandwidth target of ~1 TB/s to a goal of 50TB/s dedicated bandwidth within the first year **plus the CDN bandwidth**.

The purpose of this infrastructure would be to improve data distribution, redundancy, and availability for large public datasets hosted or distributed by the project. It will also feed into the ingestor pipeline so we will be able to utilize CloudFlare CDN to transmit to final destination efficiently.

## Motivation

Data for Canada aims to make Canadian public data easily accessible, reproducible, and distributable. As datasets grow (especially census, geospatial, statistical datasets, orthoimagery, and most importantly field imagery and health related datasets), bandwidth and geographic distribution become critical. **We will utilize this same infrastructure for the Data for the Universe** project.

A distributed infrastructure could provide:

- Faster downloads for users in different regions
- Redundancy in case of outages
- Better distribution of large datasets
- Improved resilience against hosting limits or throttling, or unexpected events
- Support for future services such as mirrors, torrents, or dataset seeding

## Proposed Approach

Provision infrastructure across multiple regions using a combination of:

- VPS providers
- Seedbox providers
- Budget hosting services
- Community-contributed nodes
- CDNs (ex. CloudFlare, Akamai, EdgeNet, Fastly, Bunny.net, Cloud CDN, etc)

The goal would be to aggregate global bandwidth capacity, rather than relying on a single hosting provider.

Potential uses for nodes:

- Dataset mirrors
- Torrent seeders
- Static file distribution
- CDN-like edge distribution
- Backup nodes for dataset packages

## Rough Bandwidth Target

Initial long-term goal: ~1 TB/s aggregate bandwidth capacity within first week, 10TB/s within first month, I can do it super cheap 😆.

This does not need to come from a single provider and can be achieved through a distributed network of many smaller hosts.

## Implementation Ideas

Possible architecture:

- Regional mirrors (All continents minus Antarctica 🐧🥳)
- Torrent seeding for large datasets
- Object storage replication
- Geo-distributed download endpoints
- Automated selective synchronization between nodes

## Tools that will be used

- rclone
- BitTorrent
- libp2p
- IPFS
- CDN frontends
- GitHub Releases + mirrors
- CloudFlare Containers
- Hardened container images

## Tasks

- [x] Research affordable VPS providers with high bandwidth limits
- [x] Identify seedbox providers suitable for dataset seeding
- [ ] Hardened container images
- [ ] Define node roles (mirror, seeder, archive)
- [x] Create automated dataset sync system
- [x] Design mirror directory structure
- [x] [Evaluate cost vs bandwidth efficiency](https://bdon.github.io/cng-storage-guide/)
- [ ] Recruit volunteer mirror operators
- [ ] Document infrastructure architecture

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global Bandwidth Infrastructure for Data for Canada (Target: 1 TB/s) #33

Summary

Motivation

Proposed Approach

Rough Bandwidth Target

Implementation Ideas

Tools that will be used

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Global Bandwidth Infrastructure for Data for Canada (Target: 1 TB/s) #33

Description

Summary

Motivation

Proposed Approach

Rough Bandwidth Target

Implementation Ideas

Tools that will be used

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions