Skip to content

Implement a time-based sharding approach to data collection #19

Open
@SensibleWood

Description

@SensibleWood

User Story

As a tooling developer I want data to be collected consistently and without failing due to rate limits applied at any source code repository platform.

Detailed Requirement

GitHub (obviously) applies rate limits on API calls, which we rely on heavily to collect data. As we expand the number of topics we are collecting we need to be cognisant of the limits and amend our approach to spread the collection period over multiple hours.

There's a few approaches:

  1. A simple manual slicing of the workload based on the alphabet (low sophistication, much manual tweaking).
  2. Splitting the build into multiple steps to seed files for later processing (medium sophistication, limited manual tweaking).
  3. Splitting the build as per option 2 and using a dependency mechanism to allow a build to trigger others (high sophistication, largely automated)

Option 3 seems feasible. The most sensible option seems to be:

  • Run a "collection" mechanism to get the superset of repositories we will query for their metadata.
  • Based on the collected data shard the data set into multiple groups, each bound to a given schedule.
  • Write workflow files based on the known rate limits at a given repository platform, target data set and schedule.
  • Allow the builds to run of their own volition.

This approach should scale as we collect more data. The main thing to be aware of is the overall build time limits, although that should be "OK" as we have a fair amount of head room for the time being.

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataIssue relates to the tooling data collected from data sourcesenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions