-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Spent about 20 minutes with ChatGPT to create this issue. My plans are actually bigger than this 🤓😎✌️
Summary
To support the long-term goals of Data for Canada, we should explore provisioning a globally distributed network of VPS instances, seed boxes, and low-cost hosting providers with an aggregate bandwidth target of ~1 TB/s to a goal of 50TB/s dedicated bandwidth within the first year plus the CDN bandwidth.
The purpose of this infrastructure would be to improve data distribution, redundancy, and availability for large public datasets hosted or distributed by the project. It will also feed into the ingestor pipeline so we will be able to utilize CloudFlare CDN to transmit to final destination efficiently.
Motivation
Data for Canada aims to make Canadian public data easily accessible, reproducible, and distributable. As datasets grow (especially census, geospatial, statistical datasets, orthoimagery, and most importantly field imagery and health related datasets), bandwidth and geographic distribution become critical. We will utilize this same infrastructure for the Data for the Universe project.
A distributed infrastructure could provide:
- Faster downloads for users in different regions
- Redundancy in case of outages
- Better distribution of large datasets
- Improved resilience against hosting limits or throttling, or unexpected events
- Support for future services such as mirrors, torrents, or dataset seeding
Proposed Approach
Provision infrastructure across multiple regions using a combination of:
- VPS providers
- Seedbox providers
- Budget hosting services
- Community-contributed nodes
- CDNs (ex. CloudFlare, Akamai, EdgeNet, Fastly, Bunny.net, Cloud CDN, etc)
The goal would be to aggregate global bandwidth capacity, rather than relying on a single hosting provider.
Potential uses for nodes:
- Dataset mirrors
- Torrent seeders
- Static file distribution
- CDN-like edge distribution
- Backup nodes for dataset packages
Rough Bandwidth Target
Initial long-term goal: ~1 TB/s aggregate bandwidth capacity within first week, 10TB/s within first month, I can do it super cheap 😆.
This does not need to come from a single provider and can be achieved through a distributed network of many smaller hosts.
Implementation Ideas
Possible architecture:
- Regional mirrors (All continents minus Antarctica 🐧🥳)
- Torrent seeding for large datasets
- Object storage replication
- Geo-distributed download endpoints
- Automated selective synchronization between nodes
Tools that will be used
- rclone
- BitTorrent
- libp2p
- IPFS
- CDN frontends
- GitHub Releases + mirrors
- CloudFlare Containers
- Hardened container images
Tasks
- Research affordable VPS providers with high bandwidth limits
- Identify seedbox providers suitable for dataset seeding
- Hardened container images
- Define node roles (mirror, seeder, archive)
- Create automated dataset sync system
- Design mirror directory structure
- Evaluate cost vs bandwidth efficiency
- Recruit volunteer mirror operators
- Document infrastructure architecture