Skip to content

Reindexing queueing for touched DOIs#82

Merged
codycooperross merged 4 commits intomainfrom
doi-reindex
Mar 10, 2026
Merged

Reindexing queueing for touched DOIs#82
codycooperross merged 4 commits intomainfrom
doi-reindex

Conversation

@codycooperross
Copy link
Contributor

@codycooperross codycooperross commented Feb 27, 2026

Purpose

This PR adds a rake task that retrieves events for a day or range of days that are relevant to lupo indexed documents (like views, downloads, citations, etc.). It then sends messages to lupo to queue reindexing for touched DOIs. To avoid performance impacts, events are pulled by updated_at in day increments, only relevant events are selected, and DOI IDs are de-duplicated before being sent to lupo. Companion lupo PR here: datacite/lupo#1492

Related to: datacite/datacite#2498

Approach

Open Questions and Pre-Merge TODOs

Learning

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

  • New feature (non-breaking change which adds functionality)

  • Breaking change (fix or feature that would cause existing functionality to change)

Reviewer, please remember our guidelines:

  • Be humble in the language and feedback you give, ask don't tell.
  • Consider using positive language as opposed to neutral when offering feedback. This is to avoid the negative bias that can occur with neutral language appearing negative.
  • Offer suggestions on how to improve code e.g. simplification or expanding clarity.
  • Ensure you give reasons for the changes you are proposing.

Summary by CodeRabbit

  • New Features

    • Command-line reindexing task to identify and process unique DOIs touched within a specified date range, with date validation and sensible defaults.
    • Expanded relation tracking: additional relation types now affect counting and target tracking.
  • Chores

    • Prepared messaging helper for DOI reindexing to support future queued indexing (no runtime queueing yet).

@coderabbitai
Copy link

coderabbitai bot commented Feb 27, 2026

📝 Walkthrough

Walkthrough

Adds two relation-type constants, a class method that collects unique DOIs touched across an inclusive date range, a Rake task to invoke that method with date validation, and a new SQS helper method for sending DOI reindex messages.

Changes

Cohort / File(s) Summary
Relation Type Constants
app/constants/relation_types.rb
Added SOURCE_RELATION_TYPES = ["references", "parts", "versions"].freeze and TARGET_RELATION_TYPES = ["views", "downloads", "citations", "part_of", "version_of"].freeze.
DOI collection logic
app/models/event.rb
Added Event.reindex_touched_dois(start_date:, end_date:, threads: 10) — iterates each date in the inclusive range, queries source_doi and target_doi rows filtered by the new relation-type constants and updated_at on that date, accumulates unique DOIs, and returns the total count; contains a commented parallel/enqueue block.
Rake task & SQS helper
lib/tasks/event.rake, app/utilities/sqs_utilities.rb
Added event:reindex_touched_dois Rake task that parses START_DATE (required) and END_DATE (optional, defaults to START_DATE), validates date order, calls Event.reindex_touched_dois, and prints the count; added SqsUtilities.send_events_doi_index_message(data) to send DOI reindex messages to the lupo_background queue.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Reindexing queueing for touched DOIs' accurately describes the main changes: adding constants for relation types, a method to identify and reindex touched DOIs across a date range, and infrastructure to queue reindexing jobs via SQS and Rake tasks.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch doi-reindex

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/models/event.rb`:
- Around line 146-157: Move the Set initialization out of the per-day loop so
DOIs are accumulated uniquely across the entire start_date..end_date range
(i.e., initialize dois = Set.new before
start_date.to_date.upto(end_date.to_date)), and when collecting DOIs from the
two queries (the distinct pluck of :source_doi and :target_doi filtered by
updated_at and RelationTypes::... constants) filter out nil/blank values (e.g.,
reject(&:blank?) or compact + reject('')) before inserting into the set; finally
compute total = dois.size after the loop instead of incrementing per day so the
count reflects unique, non-blank DOIs for the whole range.
- Around line 143-150: Convert the singleton method definition to use class <<
self for reindex_touched_dois, and refactor the two long query chains that push
DOIs into the local Set (the lines invoking where(updated_at:
date.all_day).where(source_relation_type_id:
RelationTypes::SOURCE_RELATION_TYPES).distinct.pluck(:source_doi) and the
analogous target_doi line) into multi-line method chains so each chained call
sits on its own indented line and stays below 120 characters; keep the logic
identical (build dois Set, iterate dates, pluck and << each doi) but replace def
self.reindex_touched_dois with class << self / end and break the long
where/.../pluck expressions across multiple lines using one where per line
before calling distinct and pluck.

In `@lib/tasks/event.rake`:
- Line 53: The task prints "Sent #{count} unique DOIs for re-indexing." but
Event.reindex_touched_dois currently only computes/counts DOIs (enqueue calls
are commented out), so update the task to accurately reflect reality: either
enable sending by uncommenting/restoring the enqueue logic inside
Event.reindex_touched_dois (so the message "Sent ..." is truthful) or change the
message to "Computed #{count} unique DOIs for re-indexing (no messages
enqueued)" or similar; reference Event.reindex_touched_dois to locate the
enqueue calls to restore, or update the puts line in lib/tasks/event.rake to use
wording that does not imply messages were sent.
- Around line 48-51: The multiline call to Event.reindex_touched_dois with
keyword args start_date and end_date is missing a trailing comma; update the
call (the invocation of Event.reindex_touched_dois) to include a trailing comma
after the end_date argument to satisfy Style/TrailingCommaInArguments (e.g.,
ensure the argument list ends with a comma on the last line).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 7e944d6 and b9259d3.

📒 Files selected for processing (4)
  • app/constants/relation_types.rb
  • app/models/event.rb
  • app/utilities/sqs_utilities.rb
  • lib/tasks/event.rake

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
lib/tasks/event.rake (1)

43-44: Consider adding user-friendly error handling for invalid date formats.

Date.parse raises ArgumentError with a generic message on invalid input (e.g., "2025-13-01"). For operator-facing tasks, wrapping in a rescue with a clearer message can improve usability.

💡 Optional improvement
-    start_date = Date.parse(ENV["START_DATE"])
-    end_date   = Date.parse(ENV["END_DATE"].presence || ENV["START_DATE"])
+    begin
+      start_date = Date.parse(ENV["START_DATE"])
+      end_date   = Date.parse(ENV["END_DATE"].presence || ENV["START_DATE"])
+    rescue ArgumentError
+      raise "Invalid date format. Use YYYY-MM-DD (e.g., START_DATE=2025-06-11)"
+    end
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/tasks/event.rake` around lines 43 - 44, Wrap the Date.parse calls for
start_date and end_date in a rescue for ArgumentError so invalid formats produce
a clear, operator-facing message; specifically, when parsing ENV["START_DATE"]
and ENV["END_DATE"] (used to set start_date and end_date) rescue ArgumentError
and abort or raise with a user-friendly message that includes the offending
value and expected format (e.g., "Invalid date for START_DATE: '...' — expected
YYYY-MM-DD"); ensure both start_date and end_date parsing use the same handling
so errors are consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@lib/tasks/event.rake`:
- Around line 43-44: Wrap the Date.parse calls for start_date and end_date in a
rescue for ArgumentError so invalid formats produce a clear, operator-facing
message; specifically, when parsing ENV["START_DATE"] and ENV["END_DATE"] (used
to set start_date and end_date) rescue ArgumentError and abort or raise with a
user-friendly message that includes the offending value and expected format
(e.g., "Invalid date for START_DATE: '...' — expected YYYY-MM-DD"); ensure both
start_date and end_date parsing use the same handling so errors are consistent.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between b9259d3 and 0c8913a.

📒 Files selected for processing (3)
  • app/constants/relation_types.rb
  • app/models/event.rb
  • lib/tasks/event.rake
🚧 Files skipped from review as they are similar to previous changes (2)
  • app/models/event.rb
  • app/constants/relation_types.rb

@codycooperross codycooperross changed the title Test implementation of reindexing queueing for touched DOIs Reindexing queueing for touched DOIs Mar 5, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
app/models/event.rb (1)

145-167: ⚠️ Potential issue | 🟠 Major

Count uniqueness across the full date range and exclude blank DOIs.

Line 148 resets the set per day, so a DOI touched on multiple days is counted multiple times. Lines 150-157 can also include nil/blank DOI values.

Proposed fix
     def reindex_touched_dois(start_date:, end_date:, threads: 10)
-      total = 0
+      all_dois = Set.new

       start_date.to_date.upto(end_date.to_date) do |date|
-        dois = Set.new
+        daily_dois = Set.new

         where(updated_at: date.all_day)
           .where(source_relation_type_id: RelationTypes::SOURCE_RELATION_TYPES)
-          .distinct.pluck(:source_doi)
-          .each { |doi| dois << doi }
+          .where.not(source_doi: [nil, ""])
+          .distinct
+          .pluck(:source_doi)
+          .each { |doi| daily_dois << doi }

         where(updated_at: date.all_day)
           .where(target_relation_type_id: RelationTypes::TARGET_RELATION_TYPES)
-          .distinct.pluck(:target_doi)
-          .each { |doi| dois << doi }
+          .where.not(target_doi: [nil, ""])
+          .distinct
+          .pluck(:target_doi)
+          .each { |doi| daily_dois << doi }

         # Test performance before enabling SQS queues.
-        # Parallel.each(dois.to_a, in_threads: threads) do |doi|
+        # Parallel.each(daily_dois.to_a, in_threads: threads) do |doi|
         #   SqsUtilities.send_events_doi_index_message(doi)
         # end

-        total += dois.size
+        all_dois.merge(daily_dois)
       end

-      total
+      all_dois.size
     end
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/models/event.rb` around lines 145 - 167, The current code resets the Set
per day and queries DOIs per-day, causing duplicates across the full range and
including nil/blank DOIs; to fix, build a single Set (dois) outside the loop and
query once for the whole range (use updated_at between start_date and end_date)
for both source_doi and target_doi, filter out nil/blank values before adding to
the Set (or use compact/reject/blank checks), and then return total as
dois.size; refer to start_date, end_date, dois, source_doi, target_doi, and
RelationTypes::SOURCE_RELATION_TYPES / RelationTypes::TARGET_RELATION_TYPES to
locate and update the logic.
🧹 Nitpick comments (1)
app/utilities/sqs_utilities.rb (1)

9-11: Add spec coverage for send_events_doi_index_message.

This new public sender should have a focused spec (queue name + shoryuken_class + body serialization), similar to existing coverage for send_events_other_doi_job_message.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/utilities/sqs_utilities.rb` around lines 9 - 11, Add a focused spec for
the new public method send_events_doi_index_message: exercise
send_events_doi_index_message with sample payload and assert that it calls
send_message (or enqueues) with queue_name "lupo_background", shoryuken_class
"ReindexByDoiJob", and that the body is serialized correctly (match JSON
structure or stringified payload) — mirror the style of the existing spec for
send_events_other_doi_job_message to validate queue, class, and payload
serialization.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/utilities/sqs_utilities.rb`:
- Around line 9-11: send_events_doi_index_message enqueues to a non-existent
queue/class which will cause silent failures; either add the "lupo_background"
queue to your Shoryuken config (config/shoryuken.yml) and register it, or
implement the worker class ReindexByDoiJob (with a perform method) to process
messages, or add clear documentation in the repo stating this queue/job are
handled externally; update send_events_doi_index_message tests if needed to
reference the implemented ReindexByDoiJob and ensure the queue name
"lupo_background" is present in shoryuken config.

---

Duplicate comments:
In `@app/models/event.rb`:
- Around line 145-167: The current code resets the Set per day and queries DOIs
per-day, causing duplicates across the full range and including nil/blank DOIs;
to fix, build a single Set (dois) outside the loop and query once for the whole
range (use updated_at between start_date and end_date) for both source_doi and
target_doi, filter out nil/blank values before adding to the Set (or use
compact/reject/blank checks), and then return total as dois.size; refer to
start_date, end_date, dois, source_doi, target_doi, and
RelationTypes::SOURCE_RELATION_TYPES / RelationTypes::TARGET_RELATION_TYPES to
locate and update the logic.

---

Nitpick comments:
In `@app/utilities/sqs_utilities.rb`:
- Around line 9-11: Add a focused spec for the new public method
send_events_doi_index_message: exercise send_events_doi_index_message with
sample payload and assert that it calls send_message (or enqueues) with
queue_name "lupo_background", shoryuken_class "ReindexByDoiJob", and that the
body is serialized correctly (match JSON structure or stringified payload) —
mirror the style of the existing spec for send_events_other_doi_job_message to
validate queue, class, and payload serialization.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1735178c-e8ad-4e34-b307-bc840ade018e

📥 Commits

Reviewing files that changed from the base of the PR and between 8097820 and 94a5543.

📒 Files selected for processing (2)
  • app/models/event.rb
  • app/utilities/sqs_utilities.rb

Comment on lines +9 to +11
def send_events_doi_index_message(data)
send_message(data, shoryuken_class: "ReindexByDoiJob", queue_name: "lupo_background")
end
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Shoryuken queue groups =="
rg -n '^\s*groups:|^\s*[a-zA-Z_]+:|^\s*queues:|^\s*-\s*[a-zA-Z0-9_]+$' config/shoryuken.yml

echo
echo "== References to lupo_background =="
rg -nP '\blupo_background\b|_lupo_background\b' app config lib spec || true

echo
echo "== ReindexByDoiJob definitions/usages =="
rg -nP '\bclass\s+ReindexByDoiJob\b|\bReindexByDoiJob\b' app lib spec || true

Repository: datacite/events

Length of output: 638


Define lupo_background queue and ReindexByDoiJob worker, or document external handling.

Line 10 enqueues to lupo_background with ReindexByDoiJob, but neither the queue nor the worker class exist in the repository. Messages will fail silently unless this is explicitly handled by an external system. Either add the queue to config/shoryuken.yml, implement the ReindexByDoiJob class, or document that this route is managed externally.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/utilities/sqs_utilities.rb` around lines 9 - 11,
send_events_doi_index_message enqueues to a non-existent queue/class which will
cause silent failures; either add the "lupo_background" queue to your Shoryuken
config (config/shoryuken.yml) and register it, or implement the worker class
ReindexByDoiJob (with a perform method) to process messages, or add clear
documentation in the repo stating this queue/job are handled externally; update
send_events_doi_index_message tests if needed to reference the implemented
ReindexByDoiJob and ensure the queue name "lupo_background" is present in
shoryuken config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants