Scrape github home for repo description, stars, and tags #511

rkent · 2025-04-08T22:24:14Z

Here we scrape the homepage of github repositories, extracting then displaying various items.

This PR started as a followup to a (perhaps future) proposal to search github for ROS packages, and the stars were needed to rate repos. Later repo descriptions and tags were added since, why not? But it turns out that the repo descriptions in the repo list are the most useful of the scraped items. (You can see the current result of this PR, as well as a few others such as including download counts and discovered github packages, in https://dev-rosindex.rosdabbler.com/).

I expect a couple of controversies with this PR:

Scraping the web page.

Scraping the web page like this is sensitive to future, unannounced changes to the web page layout by Github. The same information is also available in the Github API, but that is rate limited to just 60 requests per hour unauthenticated. It might be possible to authenticate the request, which allows 5000 requests per hour, but that could require additional effort by ros-infrastructure to setup and manage an account. I am suggesting in this PR that we do the scraping, but beware in the future that might have to change if the stability is not sufficient.

Use of Github stars to rate packages.

At least @nuclearsandwich is not enthusiastic about using github stars. Obviously I think it is useful (though imperfect like any metric), and worth doing to improve the ability of rosindex to identify significant packages.

In a future PR I may propose (and dev-rosindex shows) using download counts as an additional metric. That is also useful, but there is surprisingly little correlation between that and stars. Download counts are better for locating common utility repos that are useful but not that exciting, while github stars shows repos that specifically impressed a number of different users. So something like slam_toolbox has one of the highest Github stars ratings (1900) while only in the 10th percentile for download counts.

Signed-off-by: R Kent James <[email protected]>

rkent · 2025-04-08T22:31:49Z

Sorry, noticed a small problem after I opened this PR.

tfoote

With this being github specific it seems that using the API might be better.

For example: curl -L -H "Accept: application/vnd.github+json" -H "X-GitHub-Api-Version: 2022-11-28" https://api.github.com/repos/ros2/rclcpp

Especially if we're doing this for a bunch of packages we'll potentially want/need to use a token. I'd love to get this into the prefetch with caching instead of in the Jekyll too. In the same way that we precache the pip descriptions.

rkent · 2025-04-10T01:05:49Z

How does a token work with running on the buildfarm? I use a token myself in my github scapes, so I an familiar with it, but the limits are per account. I don't know if the "official" ROS github accounts has other competing uses, or if you can use a on-off account of some sort for this.

tfoote · 2025-04-11T22:24:52Z

So in this case, I suspect that our usage will be below the need for a token. There's a volume/rate limit for anonymous access.

But if we need a token I can provision it into the job configuration on the buildfarm. We have a number of bot accounts that we could leverage. Especially for this we'd want a bot with no permissions so that there's virtually no security risk.

We can have the script pick up the token from the environment if it exists.

rkent · 2025-04-11T22:29:43Z

As I said, the API limit for unauthenticated access is only 60 per hour. We need one access per repo, so there is no way that could be done with unauthenticated access.

My intentions are to write the script to use the token if it exists, otherwise revert to the scrape.

Re "I'd love to get this into the prefetch with caching instead of in the Jekyll too" that is more complex here because you need to know the list of repos to do that, which is an earlier step in the Ruby megafile. I'm investigating the options though.

tfoote · 2025-04-15T06:34:27Z

Re "I'd love to get this into the prefetch with caching instead of in the Jekyll too" that is more complex here because you need to know the list of repos to do that, which is an earlier step in the Ruby megafile. I'm investigating the options though.

Ahh for that, I would suggest that we just do things in the rosdistro. You can iterate the rosdistro pretty quickly using the python-rosdistro library.

import rosdistro

index = rosdistro.get_index(rosdistro.get_index_url())
for dist in index.distributions:
    distro_file = rosdistro.get_distribution_file(index,dist)
    repositories = distro_file.repositories

    print(f"Repositories in ROS distribution '{dist}':")
    for repo_name in sorted(repositories.keys()):
        repo_data = repositories[repo_name].get_data()
        print(f"- {repo_name}, {repo_data['source'] if 'source' in repo_data  else '<no source repo>'}")

With that it just needs to filter for github urls into a set, iterate them and write the files to a cache. And in Jekyll it can just query said cache instead of walking at generate time. If the cache misses the data goes to a default.

The cache could have a last updated timestamp and a TTL to prevent re-crawling too quickly.

The rosdistro DistributionCache also has all the packages by name as well as their full package.xml https://github.com/ros-infrastructure/rosdistro/blob/master/src/rosdistro/distribution_cache.py which is what we can get most if not all the metadata we need from for #444

Scrape github home for repo description, stars, and tags

ebc6acd

Signed-off-by: R Kent James <[email protected]>

rkent force-pushed the scrape-repo-home branch from 8dcf777 to ebc6acd Compare April 8, 2025 22:30

tfoote reviewed Apr 10, 2025

View reviewed changes

rkent mentioned this pull request Jun 4, 2025

Move discovery to a separate process #538

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scrape github home for repo description, stars, and tags #511

Scrape github home for repo description, stars, and tags #511

Uh oh!

rkent commented Apr 8, 2025

Uh oh!

rkent commented Apr 8, 2025

Uh oh!

tfoote left a comment

Uh oh!

rkent commented Apr 10, 2025

Uh oh!

tfoote commented Apr 11, 2025 •

edited

Loading

Uh oh!

rkent commented Apr 11, 2025

Uh oh!

tfoote commented Apr 15, 2025

Uh oh!

Uh oh!

Scrape github home for repo description, stars, and tags #511

Are you sure you want to change the base?

Scrape github home for repo description, stars, and tags #511

Uh oh!

Conversation

rkent commented Apr 8, 2025

Uh oh!

rkent commented Apr 8, 2025

Uh oh!

tfoote left a comment

Choose a reason for hiding this comment

Uh oh!

rkent commented Apr 10, 2025

Uh oh!

tfoote commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkent commented Apr 11, 2025

Uh oh!

tfoote commented Apr 15, 2025

Uh oh!

Uh oh!

tfoote commented Apr 11, 2025 •

edited

Loading