Skip to content

Fix sitemap generation skipped on subsequent dump runs#11625

Closed
Copilot wants to merge 2 commits intomasterfrom
copilot/fix-sitemaps-not-generating
Closed

Fix sitemap generation skipped on subsequent dump runs#11625
Copilot wants to merge 2 commits intomasterfrom
copilot/fix-sitemaps-not-generating

Conversation

Copy link
Contributor

Copilot AI commented Dec 24, 2025

The sitemap generation step in oldump.sh was incorrectly checking for directory existence rather than whether sitemaps needed regeneration. Result: sitemaps only generated on first run or with --overwrite, causing persistent 404s.

Technical

Before:

if [[ ! -d $TMPDIR/sitemaps ]]
then
    log "generating sitemaps"
    mkdir -p $TMPDIR/sitemaps
    # ... generate ...
else
    log "Skipping sitemaps"  # BUG: skips even for new dumps
fi

After:

log "generating sitemaps"
rm -rf $TMPDIR/sitemaps      # Always clean slate
mkdir -p $TMPDIR/sitemaps
# ... generate ...

Directory persistence ≠ sitemap freshness. Now unconditionally regenerates per dump.

Testing

Run oldump.sh twice with same date, different data. Verify sitemaps regenerate both times (previously skipped second run).

Screenshot

N/A - shell script fix

Stakeholders

@mekarpeles

Original prompt

This section details on the original issue you should resolve

<issue_title>Sitemaps not generating</issue_title>
<issue_description>Related to #7580

Eg https://openlibrary.org/static/sitemaps/sitemap_authors_0000.xml.gz . Seems like they've been erroring since June...

How to Test

ssh -A ol-home0
cd /opt/openlibrary
docker exec -it -uopenlibrary openlibrary_cron-jobs_1 bash
cd /1/var/tmp/sitemaps  # make sure it is empty
python /openlibrary/scripts/sitemaps/sitemap.py /1/var/tmp/dumps/ol_dump_2023-02-28/ol_dump_2023-02-28.txt.gz

Then check /1/var/tmp/sitemaps

Evidence / Screenshot (if possible)

Relevant URL?

https://openlibrary.org/static/sitemaps/sitemap_authors_0000.xml.gz

Steps to Reproduce

  1. Go to https://openlibrary.org/static/sitemaps/sitemap_authors_0000.xml.gz
  • Actual: 404
  • Expected: sitemaps

Details

  • Logged in (Y/N)? Y
  • Browser type/version? FF
  • Operating system? Win10
  • Environment (prod/dev/local)? prod

Proposal & Constraints

Related files

Stakeholders

</issue_description>

<agent_instructions>There is a bug in the sitemap generation logic of scripts/oldump.sh that can cause sitemaps to only ever be generated on the first run (or when --overwrite is specified), and then skipped on all subsequent runs—even when new dumps are created.

Analysis:

  • On each run, the script checks if the $TMPDIR/sitemaps directory exists. If not, it creates it and generates sitemaps. If it already exists (from previous runs), it skips generating sitemaps!
  • However, the mere presence of the directory does not mean sitemaps are up-to-date or were successfully generated.
  • As a result, sitemaps are never re-generated unless --overwrite is passed (which wipes the directory), even when new dumps are produced.

Suggested Fix:

  • Instead of checking for the existence of the $TMPDIR/sitemaps directory, check for specific output files or always re-generate sitemaps when a new dump is created.
  • Or remove the directory at the start of each sitemap generation step to ensure it regenerates each time.

Repro:

  1. Run the script for the first time for a given dump date. Sitemaps are generated.
  2. Run the script again for another date without --overwrite. The sitemaps directory exists already, so the script logs "Skipping sitemaps" and nothing is produced.

Relevant code:

if [[ ! -d $TMPDIR/sitemaps ]]
then
log "generating sitemaps"
mkdir -p $TMPDIR/sitemaps
rm -fr $TMPDIR/sitemaps/*
cd $TMPDIR/sitemaps
time python $SCRIPTS/sitemaps/sitemap.py $TMPDIR/dumps/$dump/$dump.txt.gz > sitemaps.log
# rm -fr $TMPDIR/sitemaps
ls -lh
else
log "Skipping sitemaps"
fi

Testing:

We ran the oldump.sh script without --archive or --overwrite using the existing data for 2025-11-30

Evidence data exists:
ol-home0:/1/var/tmp has data.txt.gz, ol_cdump_2025-11-30/ and ol_dump_2025-11-30/

Command:

PSQL_PARAMS=<REDACRTED> TMPDIR='/1/var/tmp' OL_CONFIG='/olsystem/etc/openlibrary.yml' /openlibrary/scripts/oldump.sh 2025-11-30

Immediately completed, said sitemap skipped.

We manually created the sitemaps directory and re-ran with success.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@mekarpeles There is a bug in the sitemap generation logic of `scripts/oldump.sh` that can cause sitemaps to only ever be generated on the first run (or when `--overwrite` is specified), and then skipped on all subsequent runs—even when new dumps are created.

Analysis:

  • On each run, the script checks if the $TMPDIR/sitemaps directory exists. If not, it creates it and generates sitemaps. If it already exists (from previous runs), it skips generating sitemaps!
  • However, the mere presence of the directory does not mean sitemaps are up-to-date or were successfully generated.
  • As a result, sitemaps are never re-generated unless --overwrite is passed (which wipes the directory), even when new dumps are produced.

Suggested Fix:

  • Instead of checking for the existence of the $TMPDIR/sitemaps directory, check for specific output files or always re-generate sitemaps when a new dump is created.
  • Or remove the directory at the start of each sitemap generation step to ensure it regenerates each time.

Repro:

  1. Run the script for the first time for a given dump date. Sitemaps are generated.
  2. Run the script again for another date without --overwrite. The sitemaps directory exists already, so the script logs "Skipping sitemaps" and nothing is produced.

Relevant code:
https://github.com/internetarchive/openlibrary/blob/f89d394b2d7e193657b0a02b3893f824791bd23f/s...


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Remove conditional check that prevented sitemap regeneration on subsequent runs.
The script now always removes and recreates the sitemaps directory, ensuring
sitemaps are generated for each dump run instead of only on the first run.

Fixes issue where sitemaps were only generated when --overwrite was specified
or on the very first run, causing 404 errors on sitemap URLs.

Co-authored-by: mekarpeles <978325+mekarpeles@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix sitemaps generation issue in OpenLibrary Fix sitemap generation skipped on subsequent dump runs Dec 24, 2025
Copilot AI requested a review from mekarpeles December 24, 2025 18:43
@mekarpeles
Copy link
Member

@copilot I don't think we want to generate sitemaps if the directory already exists (and it might exist). Do we still need a guard but basically only do sitemaps if the directory doesn't exist? Was that the problem?

Copy link
Contributor Author

Copilot AI commented Dec 24, 2025

@copilot I don't think we want to generate sitemaps if the directory already exists (and it might exist). Do we still need a guard but basically only do sitemaps if the directory doesn't exist? Was ...

You're right to question this. The original bug was that the directory check prevented regeneration when running for NEW dump dates.

The issue: sitemaps directory isn't date-specific, so once it exists from 2025-11-01, it blocks generation for 2025-12-01.

I see three options:

  1. Current fix: Always regenerate (simple but potentially wasteful)
  2. Smarter guard: Check if sitemaps exist for THIS specific dump
  3. Date-specific dir: Use $TMPDIR/sitemaps_$yyyymmdd

Which approach do you prefer? The original skip logic would work fine if we check for date-specific sitemap files or use a date-specific directory.

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Dec 25, 2025
@mekarpeles mekarpeles closed this Dec 25, 2025
mekarpeles added a commit that referenced this pull request Dec 25, 2025
Fixes a bug where absence of sitemap directory is preventing sitemaps from being run after `--overwrite` deletes `sitemaps/` directory.

Closes #7581, correctly implements #11625
mystic-06 pushed a commit to mystic-06/openlibrary that referenced this pull request Jan 11, 2026
Fixes a bug where absence of sitemap directory is preventing sitemaps from being run after `--overwrite` deletes `sitemaps/` directory.

Closes internetarchive#7581, correctly implements internetarchive#11625
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs: Response Issues which require feedback from lead

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sitemaps not generating

2 participants