Implement a time-based sharding approach to data collection

## User Story

As a tooling developer I want data to be collected consistently and without failing due to rate limits applied at any source code repository platform.

## Detailed Requirement

GitHub (obviously) applies [rate limits](https://docs.github.com/en/developers/apps/building-github-apps/rate-limits-for-github-apps) on API calls, which we rely on heavily to collect data. As we expand the number of topics we are collecting we need to be cognisant of the limits and amend our approach to spread the collection period over multiple hours.

There's a few approaches:

1. A simple manual slicing of the workload based on the alphabet (low sophistication, much manual tweaking).
2. Splitting the build into multiple steps to seed files for later processing (medium sophistication, limited manual tweaking).
3. Splitting the build as per option 2 and using a dependency mechanism to allow a build to trigger others (high sophistication, largely automated)

Option 3 seems feasible. The most sensible option seems to be:

 * Run a "collection" mechanism to get the superset of repositories we will query for their metadata.
 * Based on the collected data shard the data set into multiple groups, each bound to a given schedule.
 * Write workflow files based on the known rate limits at a given repository platform, target data set and schedule.
 * Allow the builds to run of their own volition.

This approach should scale as we collect more data. The main thing to be aware of is the overall build time limits, although that should be "OK" as we have a fair amount of head room for the time being.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement a time-based sharding approach to data collection #19

User Story

Detailed Requirement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement a time-based sharding approach to data collection #19

Description

User Story

Detailed Requirement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions