GitHub Limitations

API Limitations

The GitHub scraper is reliant on limitations found within GitHub's REST and GraphQL APIs. The following limitations are known:

The original creation date of a branch is not available via either of the APIs. GitSCM (the tool) does provide Ref creation time however this is not exposed. As such, we're forced to calculate the age by looking to see if any changes have been made to the branch, using that commit as the time from which we can grab the date. This means that age will reflect the time between now and the first commit on a new branch. It also means that we don't have ages for branches that have been created from trunk but have not had any changes made to them.
It's possible that some queries may run against a branch that has been deleted. This is unlikely due to the speed of the requests, however, possible.
Both APIs have primary and secondary rate limits applied to them. The default rate limit for GraphQL API is 5,000 points per hour when authenticated with a GitHub Personal Access Token (PAT). If using the GitHub App Auth extension then your rate limit increases to 10,000. The receiver on average costs 4 points per repository (which can heavily fluctuate), allowing it to scrape up to 1250 repositories per hour under normal conditions. You may use the following equation to roughly calculate your ideal collection interval.

$$\text{collection\_interval (seconds)} = \frac{4n}{r/3600}$$

$$\begin{aligned} \text{where:} \\\ n &= \text{number of repositories} \\\ r &= \text{hourly rate limit} \\\ \end{aligned}$$

In addition to these primary rate limits, GitHub enforces secondary rate limits to prevent abuse and maintain API availability. The following secondary limit is particularly relevant:

Concurrent Requests Limit: The API allows no more than 100 concurrent requests. This limit is shared across the REST and GraphQL APIs. Since the scraper creates a goroutine per repository, having more than 100 repositories returned by the search_query will result in exceeding this limit. It is recommended to use the search_query config option to limit the number of repositories that are scraped. We recommend one instance of the receiver per team (note: team is not a valid quantifier when searching repositories topic is). Reminder that each instance of the receiver should have its own corresponding token for authentication as this is what rate limits are tied to.

In summary, we recommend the following:

One instance of the receiver per team
Each instance of the receiver should have its own token
Leverage search_query config option to limit repositories returned to 100 or less per instance, or use concurrency_limit to control concurrent requests
collection_interval should be long enough to avoid rate limiting (see above formula). A sensible default is 300s.

Automatic Retry

The scraper automatically retries requests that fail with transient HTTP errors using exponential backoff with jitter.

The following responses are retried:

502 Bad Gateway -- GitHub's proxy failed to reach the backend
503 Service Unavailable -- GitHub is temporarily down for maintenance
504 Gateway Timeout -- GitHub's backend took too long to respond
429 Too Many Requests -- primary rate limit exceeded
403 Forbidden with Retry-After header -- secondary rate limit

Plain 403 responses (permission errors) are not retried. Retries are bounded by max_retries (default 10) and the scrape context, stopping when the next collection interval begins.

Retry behaviour is configurable under retry_on_failure:

github:
  github_org: my-org
  retry_on_failure:
    enabled: true              # default
    max_retries: 10            # default; 0 = unlimited (bounded by context)
    initial_interval: 1s       # default
    max_interval: 30s          # default
    multiplier: 1.5            # default
    randomization_factor: 0.5  # default

Configuration

Concurrency Limiting

Important: This does not guarantee that the secondary rate limit will not be hit. It simply reduces the likelihood. In large repositories with lots of history to iterate through, the chance of hitting the secondary rate limit increases. If this value is too high, 504/502/403 errors will show up.

The scraper supports limiting the number of concurrent repository processing goroutines to reduce the likelihood of hitting GitHub's 100 concurrent secondary request limit:

scrapers:
  scraper:
    github_org: myorg
    concurrency_limit: 50  # Default: 50, Set to 0 for unlimited (not recommended)

Default: 50 concurrent goroutines
Recommendation: Keep at default (50) to reduce the likelihood of hitting GitHub's secondary limit of 100 concurrent requests
For large organizations (>100 repos): Consider increasing collection_interval in addition to reducing the concurrency limit.

Additional Resources:

Branch Data Limitations

Due to the limitations of the GitHub GraphQL and REST APIs, some data retrieved may not be as expected. Notably there are spots in the code which link to this section that make decisions based on these limitations.

Queries are constructed to maximize performance without being overly complex. Note that there are sections in the code where BehindBy is being used in place of AheadBy and vice versa. This is a byproduct of the getBranchData query which returns all the refs (branches) from a given repository and the comparison to the default branch (trunk). Comparing it here reduces the need to make a query that gets all the names of the refs (branches), and then queries against each branch.

Another such byproduct of this method is the skipping of metric creation if the branch is the default branch (trunk) or if no changes have been made to the ref (branch). This is done for three main reasons.

The default branch will always be a long-lived branch and may end up with more commits than can be possibly queried at a given time.
The default is the trunk of which all changes should go into. The intent of these metrics is to provide useful signals helping identify cognitive overhead and bottlenecks.
GitHub does not provide any means to determine when a branch was actually created. Git the tool however does provide a created time for each ref off the trunk. GitHub does not expose this data via their APIs and thus we have to calculate age based on commits added to the branch.

We also have to calculate the number of pages before getting the commit data. This is because you have to know the exact number of commits added to the branch, otherwise you'll get all commits from both trunk and the branch from all time. From there we can evaluate the commits on each branch. To calculate the time (age) of a branch, we have to know the commits that have been added to the branch because GitHub does not provide the actual created date of a branch through either of its APIs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Limitations

API Limitations

Automatic Retry

Configuration

Concurrency Limiting

Branch Data Limitations

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

GitHub Limitations

API Limitations

Automatic Retry

Configuration

Concurrency Limiting

Branch Data Limitations