Open
Description
User Story
As a tooling developer I want data to be collected consistently and without failing due to rate limits applied at any source code repository platform.
Detailed Requirement
GitHub (obviously) applies rate limits on API calls, which we rely on heavily to collect data. As we expand the number of topics we are collecting we need to be cognisant of the limits and amend our approach to spread the collection period over multiple hours.
There's a few approaches:
- A simple manual slicing of the workload based on the alphabet (low sophistication, much manual tweaking).
- Splitting the build into multiple steps to seed files for later processing (medium sophistication, limited manual tweaking).
- Splitting the build as per option 2 and using a dependency mechanism to allow a build to trigger others (high sophistication, largely automated)
Option 3 seems feasible. The most sensible option seems to be:
- Run a "collection" mechanism to get the superset of repositories we will query for their metadata.
- Based on the collected data shard the data set into multiple groups, each bound to a given schedule.
- Write workflow files based on the known rate limits at a given repository platform, target data set and schedule.
- Allow the builds to run of their own volition.
This approach should scale as we collect more data. The main thing to be aware of is the overall build time limits, although that should be "OK" as we have a fair amount of head room for the time being.