Skip to content

Conversation

@jcgraybill
Copy link
Contributor

I've been doing some audits of Tigerbrew formulas, and it looks like a number of the URLs are out of date. I suspect that updating URLs will restore some packages that currently don't build to being functional again. I'm trying this in phases.

This first commit updates homepage urls, as these are informational rather than functional. The changes here are:

  1. I updated all http urls to https if an https url was available - this was done with automation.
  2. I followed all 301 (permanent) redirects - this was done with automation
  3. I manually reviewed 302 (temporary) redirects, 40X errors, errors and otherwise nonfunctional websites, and manually fixed the ones that could be fixed. I either followed links on the existing websites, or used other URLs in the formula (such as the head url) to find the existing project. I didn't update any 302 redirects that looked truly temporary - for example, a lot of sourceforge projects have what seems to be a canonical url that links to the project's specific page, and I left those canonical urls in place.
  4. While doing this, I found some packages that look like they're truly nonfunctional - I'm going to confirm this, and may follow up with PRs to remove them: Formulas that reference projects which have disappeared jcgraybill/tigerbrew#3

I realize this is a massive commit - 1,793 files - which makes it beyond unwieldy to manually review. I'm happy to share some of the scripts I've been using to look for homepage urls with issues, which can be used to check the work here. Or to break the commit up into smaller pieces, or whatever might help.

@jcgraybill
Copy link
Contributor Author

jcgraybill commented May 16, 2025

If this helps, here is a script that can do QA on the PR: if you pipe the contents of the patch file into it (e.g. cat 1365.patch | python3 this-script.py), it will show the before and after HTTP status codes for every changed homepage.

import re, subprocess, sys

for line in sys.stdin:
    package = re.findall(r"\+ b/Library/Formula/([^/]*).rb$", line)
    if package:
        print("\n" + package[0], end=": ")
    prev = re.findall(r'\-  homepage\s+"([^"]*)"', line)
    if prev:
        result = subprocess.run(['curl', '-sI', '-o', '/dev/null', '-w', '"%{http_code}"', '--connect-timeout', '2', prev[0]], stdout=subprocess.PIPE)        
        code = int(result.stdout.decode('utf-8').replace('"', ''))
        print(code, end="->")
    next = re.findall(r'\+  homepage\s+"([^"]*)"', line)
    if next:
        result = subprocess.run(['curl', '-sI', '-o', '/dev/null', '-w', '"%{http_code}"', '--connect-timeout', '2', next[0]], stdout=subprocess.PIPE)        
        code = int(result.stdout.decode('utf-8').replace('"', ''))
        print(code, end="")

@jcgraybill
Copy link
Contributor Author

Interesting stats: after this PR, Tigerbrew still contains:

  • 56 packages whose homepage is a redirect (these are redirects I left in place, from what looked like a canonical url)
  • 370 packages with homepages that do not have an https version.
  • 409 packages whose homepages are 404, server error, or server completely gone.

@mistydemeo
Copy link
Owner

Thank you for this! I'll try to review this soon.

@jcgraybill
Copy link
Contributor Author

I found a few of the new URLs that I don't love, doing some spot checks this weekend. Let me merge one more commit today before you spend any time looking at this. Thank you!

@mistydemeo
Copy link
Owner

Sounds good. Thank you again for all the work on this!

@jcgraybill
Copy link
Contributor Author

jcgraybill commented May 18, 2025

Okay, thanks for your patience. A handful of Sourceforge homepages were outdated but not returning 404s or redirects, and a couple of homepages were redirecting to spam sites on urls that had expired. I fixed these, and also standardized all sourceforge URLs to https://project-name.sourceforge.net/ wherever possible. Along the way I found a few more packages where the source project may have gone missing, and added those to my list to investigate later.

I think this is ready to be looked at now. Let me know if there's anything I can do to make it easier to review, of course.

@jcgraybill
Copy link
Contributor Author

One idea I had: I could split this into two PRs. Most of these updates are simply http to https in the URL with no other changes: it should be easy to just confirm those in bulk. That would leave a much smaller number of updates to look at manually.

If that would help, say the word, and I'll go figure out some git stuff.

@sevan
Copy link
Collaborator

sevan commented May 20, 2025

@mistydemeo there's now 3 pull requests which touch more than a thousand files. Shall we split it letter by letter between us to review?

@jcgraybill
Copy link
Contributor Author

If you want, I'd happy to divide these into single-letter PRs.

@sevan
Copy link
Collaborator

sevan commented May 22, 2025

If you want, I'd happy to divide these into single-letter PRs.

Single PR for a change is fine, was just concerned about having to review one large diff which touches a thousand plus files.
Personal preference would've been commits by letter in a single PR so it is easier on the reviewer(s), but as I said, just a preference. Not a rule.

@jcgraybill
Copy link
Contributor Author

That makes total sense. Here - I split this PR up into single-letter commits, and I'll go do the same to #1374. Thanks for the suggestion!

@sevan
Copy link
Collaborator

sevan commented May 22, 2025

Thanks for the change. Will help to review these over the next couple of days.

@sevan
Copy link
Collaborator

sevan commented May 24, 2025

A, B, C, D, E, S done.

@jcgraybill
Copy link
Contributor Author

jcgraybill commented Jun 9, 2025

What do you think is the best practice for SourceForge homepage URLs? SourceForge looks to have at least four types of URLs for a project:

  1. A "web space" at a subdomain of sourceforge.io - this is roughly equivalent to Github pages at github.io. E.g., https://tkdiff.sourceforge.io/
  2. One where the "url name" for the project is a subdomain of sourceforge.net, with no path., E.g. https://qstat.sourceforge.net/ edit: it looks like this isn't very consistent - I'm now finding some projectname.sourceforge.net urls that are used as the webspace/home page.
  3. A url with a path to the project. E.g., https://sourceforge.net/projects/tkdiff/
  4. A rarely used "short" url that returns a 301 redirect to 3. E.g., https://sourceforge.net/p/tkdiff/

Based on whether a project uses its web space, the first two URLs will do 301 redirects between each other, or 302 redirects to the full path to the project (type 3). If there's a web space, it's a pretty sure thing that it's the project's homepage, so that's straightforward.

If there isn't a webspace, I'm thinking of an approach where we follow 301 ("Moved Permanently") redirects, and don't follow 302 ("Moved Temporarily") redirects - akin to what a search indexer is supposed to do. So if SourceForge redirects https://qstat.sourceforge.net/ to https://sourceforge.net/projects/qstat/ with a 302 redirect, the homepage url we'd use is https://qstat.sourceforge.net/, on the premise that SourceForge is telling us that they reserve the right to change that redirect in the future. OTOH we could use the longer URL on the premise that it's less opaque, saves users a redirect, and is so widely used that it seems really unlikely to change in practice.

Taking this slightly further, if the homepage for a formula is already in the form of https://sourceforge.net/projects/qstat/, but a url like https://qstat.sourceforge.net/ exists and returns a 302 redirect to it, should we assume that the latter is a more canonical url for the project, and change the formula's homepage to be the shorter url?

Some stats about what's in the codebase today. If there's a preference for one approach or another, I could amend this PR to be more standardized than I've been.

branch master update-homepages
sourceforge.io 6 67
sourceforge.net 263 314
sourceforge.net/p/ 6 5
sourceforge.net/projects/ 41 14

@sevan
Copy link
Collaborator

sevan commented Jun 10, 2025

What do you think is the best practice for SourceForge homepage URLs? SourceForge looks to have at least four types of URLs for a project:

Some of your replacements are valid, hence I have left those alone.

jcgraybill added a commit to jcgraybill/tigerbrew that referenced this pull request Jun 10, 2025
@jcgraybill
Copy link
Contributor Author

I'm going through and doing letter-by-letter manual review of these, aiming to proactively fix the kinds of issues you've spotted in earlier letters. I'm about six letters ahead of you now :). Apologies that this makes the patch messy. If you're using the patch to review this and it's annoying to look in two places, let me know and I can try to do some git shenanigans to clean that up.

@jcgraybill
Copy link
Contributor Author

jcgraybill commented Jun 19, 2025

I'm going through and doing letter-by-letter manual review of these, aiming to proactively fix the kinds of issues you've spotted in earlier letters. I'm about six letters ahead of you now :). Apologies that this makes the patch messy. If you're using the patch to review this and it's annoying to look in two places, let me know and I can try to do some git shenanigans to clean that up.

...and, done! This should be much easier to review now - I've confirmed every homepage here is an actual homepage for the software in question, or the most current archive.org snapshot of a page if I wasn't able to find an extant one.

This leaves around 350 homepage entries in formulas that are invalid in some way, but where there aren't redirects in place to follow. Using the Wayback Machine API I have a process where I can pretty quickly find the latest valid Wayback Machine snapshot for each of these, so I'll go ahead and do that next. I'll commit these letter-by-letter as I fix them.

@jcgraybill
Copy link
Contributor Author

jcgraybill commented Jun 21, 2025

Okay - with those updates in place, I've now confirmed that every tigerbrew homepage URL that didn't return an http 200 now goes to a valid homepage for the project, and every http url that could be upgraded to https has been. Whew!

@sevan
Copy link
Collaborator

sevan commented Jun 21, 2025

Well done.
One last thing before I wade in to review, can you trim the commit history down to 1 commit per letter? you're up to 82 commits so far at the moment. :)

@jcgraybill jcgraybill reopened this Jun 22, 2025
@jcgraybill
Copy link
Contributor Author

Reorganized!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants