-
Notifications
You must be signed in to change notification settings - Fork 18
Description
here is a copy of some thoughts on the topic,
"""
original.boost.org has an algorithm, when it intends to forward you to the same page on the most recent version. Just as you are showing. But, it checks to see if that page exists. If the newer pages doesn't exist, it falls back to the main page for the library, and then further, to the main page for the boost-version. A series of verifications and fallbacks. I believe this should be implemented, in the same way as original.boost.org, since that algorithm is not overly complicated or difficult to understand and to use. There could be an enhancement, a step 2, later on, which is to have AI review all the results, propose improvements, and put those into a separate db table. The system would need to revisit that table after every new boost release, since it changes every release. It is a bit expensive, and a large query for AI, and in my opinion, it's overkill, for a feature that is basic: "here is a link to the new doc version. It's going to be right 90% of the time. check first if it's valid." No need for AI. If original.boost.org could do that, website-v2 can also. Record the success or failure of the check in a db column.
"""
From earlier:
"""
What's the simplest solution that could possibly work...
Maybe it's shockingly simple. The thing is just to make the smallest change needed so that redirects aren't sometimes broken (404). That's enough for now. Follow the idea from original.boost.org. And that just means, a quick verification step.
After a clearing of the rendercontent table, the first visit to a page would be approximately twice as expensive as a normal page load. Not 5 or 10 times. It should be possible for django to handle it in-line, without a hand-off to celery. Retrieve the requested page from S3, and then confirm if the 'latest' page exists. That does not require a full download of the page. "To check if an object exists in a bucket using Boto3, Call the head_object method on the S3 client, passing in the bucket and key."
Store the value in an integer in the renderedcontent table. 0/empty-> yet-unknown. 1-> the file was present. 2-> the file was missing.
In the future, querying the renderedcontent table, the results will include this number.
0 -> Still unknown. Please query for the existence of the 'latest' version of the page, and update the table with a 1 or 2.
1 -> form the canonical link by swapping in 'latest' into the string. (as done currently)
2 -> form the canonical link from the known index.html of the latest version of this boost library. As an enhancement, this url value could be added as a column in the renderedcontent table also, so it's retrieval is done with one lookup in the future, and doesn't trigger a second db call.
When a library is relocated, such as to /doc/antora/ , the pages are often completely redesigned anyway. Trying to map them is too much. It's overkill.
Already, Google will trend towards showing the 'latest' version in search results, more and more. As that happens, it lessens the importance of spending time on improving the old version pages. Merely a small detail, on an old version page.
"""
The exact boost library is not known...
"""
In that case, make a best-effort attempt.
The code from original.boost.org is here: https://github.com/boostorg/website/blob/master/common/code/boost_documentation.php It shows two messages.
"Click here to view this page for the latest version." or when it can't discover the answer "Click here for the latest Boost documentation."
Just as original.boost.org tries a few fallbacks before giving up, so could website-v2. Test each path with an AWS S3 head request.
The equivalent 'latest' page is present. Success. Record a "1". 90% are "1".
In the event of a failure, for the following list, record a "2" and also in a new field, what the discovered result is.
2. If the current path is /libs/library_name/.* try /libs/library_name/index.html
3. If the current path is /libs/library_name/.* try /doc/antora/library_name/index.html . NOT NEEDED. Should match 2 , every time. SKIP THIS.
4. If in /doc/html/boost_(library_name)/.* try /libs/library_name/index.html
5. If in /doc/html/(library_name)/.* try /libs/library_name/index.html
6. If /doc/html/(library_name).html try /libs/library_name/index.html
7. If /doc/html/boost_(library_name).html try /libs/library_name/index.html
8. Final fallback is https://www.boost.org/libraries/latest .
Since the above are regex, they would not all be attempted. A string wouldn't match all those choices. Maybe only 1-2 of them. Thus, the number of experiments for any given page is not 8 attempts. Perhaps 1-2 attempt, for any one doc page. and then it hits the final default.
"""