Skip to content

Automate DOAB harvesting on same cadence as doab-check #1129

@rdhyee

Description

@rdhyee

Background

The DOAB OAI harvester (django-admin load_doab) on unglue.it prod currently has no automated schedule. It runs only via manual SSH invocation. The captured bash history on prod shows a 2-week-window cadence maintained by hand (Oct 2025 → Feb 2026), after which it stopped logging entirely.

As of 2026-04-23, the last run logged in ~/load_doab.txt on prod ended 2026-02-17 — roughly 10 weeks of DOAB updates not reflected before today's catch-up run. (A smaller, unlogged harvest appears to have run mid-March, but neither the invocation nor the result ended up in load_doab.txt.)

Eric's direction (per meeting with Raymond): unglue.it DOAB harvesting should run on the same cadence as doab-check.

What's needed

  1. Confirm doab-check's current cadence (daily? weekly? — doab-check runs on DigitalOcean, not AWS; Eric to confirm)
  2. Add a scheduled task to unglue.it matching that cadence
  3. Log results in a way that doesn't silently fail (see related doab-check#11 "Nightly harvest cron failing silently")

Options for the schedule

Option A — Celery Beat periodic task

Register in core/tasks.py:

from celery.task.schedules import crontab

@periodic_task(run_every=crontab(...))  # match doab-check
def load_doab_scheduled():
    """Harvest DOAB updates on doab-check cadence. See #XXXX."""
    from django.core.management import call_command
    from datetime import date, timedelta
    # ... compute from_date based on last-run marker file or fixed window
    call_command('load_doab', from_date, max=20000)

Pros: auditable via Celery logs; Celery handles concurrency; no extra cron.
Cons: Celery periodic tasks have a silent-failure history on this codebase. Need explicit error notification (email ADMINS on exception — but rate-limited per #1128 when merged).

Option B — Plain cron entry (Ansible-managed)

# /etc/cron.d/regluit-doab (managed by regluit-provisioning)
0 4 * * <cadence> ubuntu cd / && /opt/regluit/venv/bin/django-admin load_doab --max=20000 --settings=regluit.settings.prod >> /var/log/regluit/doab-harvest.log 2>&1

Pros: simpler; visible in crontab; easy to reason about.
Cons: silent-failure pattern common (need CronFailureCheck or similar). Output mixed with everything else.

My lean: Option B (cron) with explicit logging to its own log file. Start simple, add alerting only when Sentry/Tier 3 monitoring lands (see EbookFoundation/regluit-provisioning#28 for the broader monitoring direction).

Additional considerations

  • The load_doab command's default from_date behavior: if omitted, what does it do? Needs verification before committing to a cron form that relies on default behavior. Currently Raymond's manual pattern passes an explicit date.
  • State tracking: a marker file (e.g., ~/.doab_last_harvest) recording the last successful until date would let the scheduled task resume correctly without hard-coding windows.
  • Idempotency: the harvester appears idempotent — log output shows loaded N records (M new) patterns where M < N consistently. Safe to re-harvest overlapping windows.

Related

  • Today's catch-up run (kicked off 2026-04-23 17:50 UTC): started from 2026-02-28, running in progress. See ~/load_doab.txt on prod for result.
  • Reliability PRs already merged: #1096, #1099, #1101, #1106, #1107, #1108 — the harvester is now robust enough to run unsupervised.
  • Provisioning monitoring issue: EbookFoundation/regluit-provisioning#28 — covers the "silent failure" detection story more broadly.

Blocked on

  • Eric's confirmation of doab-check's current cadence
  • Eric's preference between Option A (Celery) and Option B (cron)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions