-
Notifications
You must be signed in to change notification settings - Fork 23
Scrape github home for repo description, stars, and tags #511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: ros2
Are you sure you want to change the base?
Conversation
Signed-off-by: R Kent James <[email protected]>
Sorry, noticed a small problem after I opened this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this being github specific it seems that using the API might be better.
For example: curl -L -H "Accept: application/vnd.github+json" -H "X-GitHub-Api-Version: 2022-11-28" https://api.github.com/repos/ros2/rclcpp
Especially if we're doing this for a bunch of packages we'll potentially want/need to use a token. I'd love to get this into the prefetch with caching instead of in the Jekyll too. In the same way that we precache the pip descriptions.
How does a token work with running on the buildfarm? I use a token myself in my github scapes, so I an familiar with it, but the limits are per account. I don't know if the "official" ROS github accounts has other competing uses, or if you can use a on-off account of some sort for this. |
So in this case, I suspect that our usage will be below the need for a token. There's a volume/rate limit for anonymous access. But if we need a token I can provision it into the job configuration on the buildfarm. We have a number of bot accounts that we could leverage. Especially for this we'd want a bot with no permissions so that there's virtually no security risk. We can have the script pick up the token from the environment if it exists. |
As I said, the API limit for unauthenticated access is only 60 per hour. We need one access per repo, so there is no way that could be done with unauthenticated access. My intentions are to write the script to use the token if it exists, otherwise revert to the scrape. Re "I'd love to get this into the prefetch with caching instead of in the Jekyll too" that is more complex here because you need to know the list of repos to do that, which is an earlier step in the Ruby megafile. I'm investigating the options though. |
Ahh for that, I would suggest that we just do things in the rosdistro. You can iterate the rosdistro pretty quickly using the python-rosdistro library.
With that it just needs to filter for github urls into a set, iterate them and write the files to a cache. And in Jekyll it can just query said cache instead of walking at generate time. If the cache misses the data goes to a default. The cache could have a last updated timestamp and a TTL to prevent re-crawling too quickly. The rosdistro DistributionCache also has all the packages by name as well as their full package.xml https://github.com/ros-infrastructure/rosdistro/blob/master/src/rosdistro/distribution_cache.py which is what we can get most if not all the metadata we need from for #444 |
Here we scrape the homepage of github repositories, extracting then displaying various items.
This PR started as a followup to a (perhaps future) proposal to search github for ROS packages, and the stars were needed to rate repos. Later repo descriptions and tags were added since, why not? But it turns out that the repo descriptions in the repo list are the most useful of the scraped items. (You can see the current result of this PR, as well as a few others such as including download counts and discovered github packages, in https://dev-rosindex.rosdabbler.com/).
I expect a couple of controversies with this PR:
Scraping the web page like this is sensitive to future, unannounced changes to the web page layout by Github. The same information is also available in the Github API, but that is rate limited to just 60 requests per hour unauthenticated. It might be possible to authenticate the request, which allows 5000 requests per hour, but that could require additional effort by ros-infrastructure to setup and manage an account. I am suggesting in this PR that we do the scraping, but beware in the future that might have to change if the stability is not sufficient.
At least @nuclearsandwich is not enthusiastic about using github stars. Obviously I think it is useful (though imperfect like any metric), and worth doing to improve the ability of rosindex to identify significant packages.
In a future PR I may propose (and dev-rosindex shows) using download counts as an additional metric. That is also useful, but there is surprisingly little correlation between that and stars. Download counts are better for locating common utility repos that are useful but not that exciting, while github stars shows repos that specifically impressed a number of different users. So something like slam_toolbox has one of the highest Github stars ratings (1900) while only in the 10th percentile for download counts.