Fix sitemap generation skipped on subsequent dump runs#11625
Fix sitemap generation skipped on subsequent dump runs#11625
Conversation
Remove conditional check that prevented sitemap regeneration on subsequent runs. The script now always removes and recreates the sitemaps directory, ensuring sitemaps are generated for each dump run instead of only on the first run. Fixes issue where sitemaps were only generated when --overwrite was specified or on the very first run, causing 404 errors on sitemap URLs. Co-authored-by: mekarpeles <978325+mekarpeles@users.noreply.github.com>
|
@copilot I don't think we want to generate sitemaps if the directory already exists (and it might exist). Do we still need a guard but basically only do sitemaps if the directory doesn't exist? Was that the problem? |
You're right to question this. The original bug was that the directory check prevented regeneration when running for NEW dump dates. The issue: sitemaps directory isn't date-specific, so once it exists from 2025-11-01, it blocks generation for 2025-12-01. I see three options:
Which approach do you prefer? The original skip logic would work fine if we check for date-specific sitemap files or use a date-specific directory. |
Fixes a bug where absence of sitemap directory is preventing sitemaps from being run after `--overwrite` deletes `sitemaps/` directory. Closes internetarchive#7581, correctly implements internetarchive#11625
The sitemap generation step in
oldump.shwas incorrectly checking for directory existence rather than whether sitemaps needed regeneration. Result: sitemaps only generated on first run or with--overwrite, causing persistent 404s.Technical
Before:
After:
Directory persistence ≠ sitemap freshness. Now unconditionally regenerates per dump.
Testing
Run
oldump.shtwice with same date, different data. Verify sitemaps regenerate both times (previously skipped second run).Screenshot
N/A - shell script fix
Stakeholders
@mekarpeles
Original prompt
This section details on the original issue you should resolve
<issue_title>Sitemaps not generating</issue_title>
<issue_description>Related to #7580
Eg https://openlibrary.org/static/sitemaps/sitemap_authors_0000.xml.gz . Seems like they've been erroring since June...
How to Test
Then check
/1/var/tmp/sitemapsEvidence / Screenshot (if possible)
Relevant URL?
https://openlibrary.org/static/sitemaps/sitemap_authors_0000.xml.gz
Steps to Reproduce
Details
Proposal & Constraints
Related files
Stakeholders
</issue_description>
<agent_instructions>There is a bug in the sitemap generation logic of
scripts/oldump.shthat can cause sitemaps to only ever be generated on the first run (or when--overwriteis specified), and then skipped on all subsequent runs—even when new dumps are created.Analysis:
$TMPDIR/sitemapsdirectory exists. If not, it creates it and generates sitemaps. If it already exists (from previous runs), it skips generating sitemaps!--overwriteis passed (which wipes the directory), even when new dumps are produced.Suggested Fix:
$TMPDIR/sitemapsdirectory, check for specific output files or always re-generate sitemaps when a new dump is created.Repro:
--overwrite. Thesitemapsdirectory exists already, so the script logs "Skipping sitemaps" and nothing is produced.Relevant code:
openlibrary/scripts/oldump.sh
Lines 223 to 234 in f89d394
Testing:
We ran the oldump.sh script without
--archiveor--overwriteusing the existing data for2025-11-30Evidence data exists:
ol-home0:/1/var/tmphasdata.txt.gz,ol_cdump_2025-11-30/andol_dump_2025-11-30/Command:
Immediately completed, said sitemap skipped.
We manually created the sitemaps directory and re-ran with success.</agent_instructions>
Comments on the Issue (you are @copilot in this section)
@mekarpeles There is a bug in the sitemap generation logic of `scripts/oldump.sh` that can cause sitemaps to only ever be generated on the first run (or when `--overwrite` is specified), and then skipped on all subsequent runs—even when new dumps are created.Analysis:
$TMPDIR/sitemapsdirectory exists. If not, it creates it and generates sitemaps. If it already exists (from previous runs), it skips generating sitemaps!--overwriteis passed (which wipes the directory), even when new dumps are produced.Suggested Fix:
$TMPDIR/sitemapsdirectory, check for specific output files or always re-generate sitemaps when a new dump is created.Repro:
--overwrite. Thesitemapsdirectory exists already, so the script logs "Skipping sitemaps" and nothing is produced.Relevant code:
https://github.com/internetarchive/openlibrary/blob/f89d394b2d7e193657b0a02b3893f824791bd23f/s...
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.