Background
The DOAB OAI harvester (django-admin load_doab) on unglue.it prod currently has no automated schedule. It runs only via manual SSH invocation. The captured bash history on prod shows a 2-week-window cadence maintained by hand (Oct 2025 → Feb 2026), after which it stopped logging entirely.
As of 2026-04-23, the last run logged in ~/load_doab.txt on prod ended 2026-02-17 — roughly 10 weeks of DOAB updates not reflected before today's catch-up run. (A smaller, unlogged harvest appears to have run mid-March, but neither the invocation nor the result ended up in load_doab.txt.)
Eric's direction (per meeting with Raymond): unglue.it DOAB harvesting should run on the same cadence as doab-check.
What's needed
- Confirm doab-check's current cadence (daily? weekly? — doab-check runs on DigitalOcean, not AWS; Eric to confirm)
- Add a scheduled task to unglue.it matching that cadence
- Log results in a way that doesn't silently fail (see related doab-check#11 "Nightly harvest cron failing silently")
Options for the schedule
Option A — Celery Beat periodic task
Register in core/tasks.py:
from celery.task.schedules import crontab
@periodic_task(run_every=crontab(...)) # match doab-check
def load_doab_scheduled():
"""Harvest DOAB updates on doab-check cadence. See #XXXX."""
from django.core.management import call_command
from datetime import date, timedelta
# ... compute from_date based on last-run marker file or fixed window
call_command('load_doab', from_date, max=20000)
Pros: auditable via Celery logs; Celery handles concurrency; no extra cron.
Cons: Celery periodic tasks have a silent-failure history on this codebase. Need explicit error notification (email ADMINS on exception — but rate-limited per #1128 when merged).
Option B — Plain cron entry (Ansible-managed)
# /etc/cron.d/regluit-doab (managed by regluit-provisioning)
0 4 * * <cadence> ubuntu cd / && /opt/regluit/venv/bin/django-admin load_doab --max=20000 --settings=regluit.settings.prod >> /var/log/regluit/doab-harvest.log 2>&1
Pros: simpler; visible in crontab; easy to reason about.
Cons: silent-failure pattern common (need CronFailureCheck or similar). Output mixed with everything else.
My lean: Option B (cron) with explicit logging to its own log file. Start simple, add alerting only when Sentry/Tier 3 monitoring lands (see EbookFoundation/regluit-provisioning#28 for the broader monitoring direction).
Additional considerations
- The
load_doab command's default from_date behavior: if omitted, what does it do? Needs verification before committing to a cron form that relies on default behavior. Currently Raymond's manual pattern passes an explicit date.
- State tracking: a marker file (e.g.,
~/.doab_last_harvest) recording the last successful until date would let the scheduled task resume correctly without hard-coding windows.
- Idempotency: the harvester appears idempotent — log output shows
loaded N records (M new) patterns where M < N consistently. Safe to re-harvest overlapping windows.
Related
- Today's catch-up run (kicked off 2026-04-23 17:50 UTC): started from 2026-02-28, running in progress. See
~/load_doab.txt on prod for result.
- Reliability PRs already merged: #1096, #1099, #1101, #1106, #1107, #1108 — the harvester is now robust enough to run unsupervised.
- Provisioning monitoring issue: EbookFoundation/regluit-provisioning#28 — covers the "silent failure" detection story more broadly.
Blocked on
- Eric's confirmation of doab-check's current cadence
- Eric's preference between Option A (Celery) and Option B (cron)
Background
The DOAB OAI harvester (
django-admin load_doab) on unglue.it prod currently has no automated schedule. It runs only via manual SSH invocation. The captured bash history on prod shows a 2-week-window cadence maintained by hand (Oct 2025 → Feb 2026), after which it stopped logging entirely.As of 2026-04-23, the last run logged in
~/load_doab.txton prod ended 2026-02-17 — roughly 10 weeks of DOAB updates not reflected before today's catch-up run. (A smaller, unlogged harvest appears to have run mid-March, but neither the invocation nor the result ended up inload_doab.txt.)Eric's direction (per meeting with Raymond): unglue.it DOAB harvesting should run on the same cadence as doab-check.
What's needed
Options for the schedule
Option A — Celery Beat periodic task
Register in
core/tasks.py:Pros: auditable via Celery logs; Celery handles concurrency; no extra cron.
Cons: Celery periodic tasks have a silent-failure history on this codebase. Need explicit error notification (email ADMINS on exception — but rate-limited per #1128 when merged).
Option B — Plain cron entry (Ansible-managed)
Pros: simpler; visible in crontab; easy to reason about.
Cons: silent-failure pattern common (need
CronFailureCheckor similar). Output mixed with everything else.My lean: Option B (cron) with explicit logging to its own log file. Start simple, add alerting only when Sentry/Tier 3 monitoring lands (see EbookFoundation/regluit-provisioning#28 for the broader monitoring direction).
Additional considerations
load_doabcommand's defaultfrom_datebehavior: if omitted, what does it do? Needs verification before committing to a cron form that relies on default behavior. Currently Raymond's manual pattern passes an explicit date.~/.doab_last_harvest) recording the last successfuluntildate would let the scheduled task resume correctly without hard-coding windows.loaded N records (M new)patterns where M < N consistently. Safe to re-harvest overlapping windows.Related
~/load_doab.txton prod for result.Blocked on