Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically create sitemap.xml #4060

Merged
merged 26 commits into from
Dec 17, 2024
Merged

Conversation

veganstraightedge
Copy link
Contributor

@veganstraightedge veganstraightedge commented Nov 14, 2024

Issues

closes #3780
closes #1192

An investigation and discoveries

So, it turns out that we maybe never had a live sitemap.xml or .xml.gz in production, this whole time. 🤦🏻

The reason that it seemed to work in development and seemed to succeed in the deploy/release in production but not actually available in production is because… Heroku's ephemeral filesystem.

Ephemeral Disk
Heroku has an “ephemeral” hard drive, this means that you can write files to disk, but those files will not persist after the application is restarted.

So, what was happening was in development, running bundle exec rails sitemap:create or sitemap:refresh etc, would create the sitemap.xml.gz in our local /public folders. And stay there. Seems good.

In production, during the release stage of a production deploy (as defined in the Procfile), the sitemap:refresh would "succeed", but then the /public folder it was created in wouldn't necessarily be in the actual dyno/s serving any real requests. AFAICT.


My preferred requirements

When working on this, I went round and round trying to make it work with all of these conditions:

  • sitemap_generator gem (great gem, love it, would use in non-Heroku deployed apps)
  • Heroku's ephemeral filesystem
  • and keeping its location on the same domain at the root, where it can be auto-discovered without first reading robots.txt

The gem suggests and has functionality to store the generated file somewhere else (say, S3), but I'd like to keep it in its well known location.


Conclusion

In the end, I decided to create a sitemaps controller and dynamically create the file to:

  • keep it at /sitemap.xml
  • not have to deal with Heroku's ephemeral filesystem
  • keep its discovery well known and not dependent on robots.txt (a separate issue/pr to do!)

The challenge and risk, of course, is performance. Namely around articles and some of the tools (zines, etc) which have the biggest tables to scan. Especially since most items in the sitemap never change.

But that not ever really changing-ness is what allowed me to use Rails' fragment caching around each <url> item in the long list of <urlset> and reduce the page load time from ~1s to ~200ms (depending on warm cache, etc). Even at 1s, it's not the end of the world, since (I'm suspecting) that this file doesn't get read a ton.

TODO follow up

  • remove sitemap:refresh from Procfile
  • remove config/sitemap.rb
  • remove sitemap_generator gem
  • keep an eye on this in production
  • add any files/paths that i missed (or skipped for now because i ran out of steam)

@veganstraightedge veganstraightedge marked this pull request as ready for review December 16, 2024 02:32
@veganstraightedge veganstraightedge changed the title Start to fix sitemap.xml Dynamically create sitemap.xml Dec 16, 2024
@veganstraightedge veganstraightedge merged commit aae5061 into main Dec 17, 2024
9 checks passed
@veganstraightedge veganstraightedge deleted the exempt_sitemapxmlgz_path branch December 17, 2024 00:01
@veganstraightedge
Copy link
Contributor Author

veganstraightedge added a commit that referenced this pull request Dec 17, 2024
Same data as `/sitemap.xml`, but as a flat file list of URLs

- #4060

The purpose is for a simple way for an archivist to make a backup of the
whole site using cURL/wget/similar means.

# TODO

- add URLs of CSS/JS files
- add URLs of images (!!!)
- add URLS of PDFs (downloads of zines, posters, etc)
veganstraightedge added a commit that referenced this pull request Dec 17, 2024
Cleanup following:

- #4155
- #4060

# Summary

- remove `sitemap_generator` config initializer 
- remove `sitemap_generator` in `Procfile` and a test
- remove `sitemap_generator` gem
- make xml/txt formats explicit in the routes (`curl .../sitemap.xml`
was getting the `.txt` version mistakenly)
- remove duplicate `/tce` URL in both sitemaps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sitemap ping failing Figure out process for updating/uploading the sitemap.xml.gz
2 participants