Reindexing queueing for touched DOIs#82
Conversation
📝 WalkthroughWalkthroughAdds two relation-type constants, a class method that collects unique DOIs touched across an inclusive date range, a Rake task to invoke that method with date validation, and a new SQS helper method for sending DOI reindex messages. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@app/models/event.rb`:
- Around line 146-157: Move the Set initialization out of the per-day loop so
DOIs are accumulated uniquely across the entire start_date..end_date range
(i.e., initialize dois = Set.new before
start_date.to_date.upto(end_date.to_date)), and when collecting DOIs from the
two queries (the distinct pluck of :source_doi and :target_doi filtered by
updated_at and RelationTypes::... constants) filter out nil/blank values (e.g.,
reject(&:blank?) or compact + reject('')) before inserting into the set; finally
compute total = dois.size after the loop instead of incrementing per day so the
count reflects unique, non-blank DOIs for the whole range.
- Around line 143-150: Convert the singleton method definition to use class <<
self for reindex_touched_dois, and refactor the two long query chains that push
DOIs into the local Set (the lines invoking where(updated_at:
date.all_day).where(source_relation_type_id:
RelationTypes::SOURCE_RELATION_TYPES).distinct.pluck(:source_doi) and the
analogous target_doi line) into multi-line method chains so each chained call
sits on its own indented line and stays below 120 characters; keep the logic
identical (build dois Set, iterate dates, pluck and << each doi) but replace def
self.reindex_touched_dois with class << self / end and break the long
where/.../pluck expressions across multiple lines using one where per line
before calling distinct and pluck.
In `@lib/tasks/event.rake`:
- Line 53: The task prints "Sent #{count} unique DOIs for re-indexing." but
Event.reindex_touched_dois currently only computes/counts DOIs (enqueue calls
are commented out), so update the task to accurately reflect reality: either
enable sending by uncommenting/restoring the enqueue logic inside
Event.reindex_touched_dois (so the message "Sent ..." is truthful) or change the
message to "Computed #{count} unique DOIs for re-indexing (no messages
enqueued)" or similar; reference Event.reindex_touched_dois to locate the
enqueue calls to restore, or update the puts line in lib/tasks/event.rake to use
wording that does not imply messages were sent.
- Around line 48-51: The multiline call to Event.reindex_touched_dois with
keyword args start_date and end_date is missing a trailing comma; update the
call (the invocation of Event.reindex_touched_dois) to include a trailing comma
after the end_date argument to satisfy Style/TrailingCommaInArguments (e.g.,
ensure the argument list ends with a comma on the last line).
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (4)
app/constants/relation_types.rbapp/models/event.rbapp/utilities/sqs_utilities.rblib/tasks/event.rake
There was a problem hiding this comment.
🧹 Nitpick comments (1)
lib/tasks/event.rake (1)
43-44: Consider adding user-friendly error handling for invalid date formats.
Date.parseraisesArgumentErrorwith a generic message on invalid input (e.g.,"2025-13-01"). For operator-facing tasks, wrapping in a rescue with a clearer message can improve usability.💡 Optional improvement
- start_date = Date.parse(ENV["START_DATE"]) - end_date = Date.parse(ENV["END_DATE"].presence || ENV["START_DATE"]) + begin + start_date = Date.parse(ENV["START_DATE"]) + end_date = Date.parse(ENV["END_DATE"].presence || ENV["START_DATE"]) + rescue ArgumentError + raise "Invalid date format. Use YYYY-MM-DD (e.g., START_DATE=2025-06-11)" + end🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/tasks/event.rake` around lines 43 - 44, Wrap the Date.parse calls for start_date and end_date in a rescue for ArgumentError so invalid formats produce a clear, operator-facing message; specifically, when parsing ENV["START_DATE"] and ENV["END_DATE"] (used to set start_date and end_date) rescue ArgumentError and abort or raise with a user-friendly message that includes the offending value and expected format (e.g., "Invalid date for START_DATE: '...' — expected YYYY-MM-DD"); ensure both start_date and end_date parsing use the same handling so errors are consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@lib/tasks/event.rake`:
- Around line 43-44: Wrap the Date.parse calls for start_date and end_date in a
rescue for ArgumentError so invalid formats produce a clear, operator-facing
message; specifically, when parsing ENV["START_DATE"] and ENV["END_DATE"] (used
to set start_date and end_date) rescue ArgumentError and abort or raise with a
user-friendly message that includes the offending value and expected format
(e.g., "Invalid date for START_DATE: '...' — expected YYYY-MM-DD"); ensure both
start_date and end_date parsing use the same handling so errors are consistent.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (3)
app/constants/relation_types.rbapp/models/event.rblib/tasks/event.rake
🚧 Files skipped from review as they are similar to previous changes (2)
- app/models/event.rb
- app/constants/relation_types.rb
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
app/models/event.rb (1)
145-167:⚠️ Potential issue | 🟠 MajorCount uniqueness across the full date range and exclude blank DOIs.
Line 148 resets the set per day, so a DOI touched on multiple days is counted multiple times. Lines 150-157 can also include
nil/blank DOI values.Proposed fix
def reindex_touched_dois(start_date:, end_date:, threads: 10) - total = 0 + all_dois = Set.new start_date.to_date.upto(end_date.to_date) do |date| - dois = Set.new + daily_dois = Set.new where(updated_at: date.all_day) .where(source_relation_type_id: RelationTypes::SOURCE_RELATION_TYPES) - .distinct.pluck(:source_doi) - .each { |doi| dois << doi } + .where.not(source_doi: [nil, ""]) + .distinct + .pluck(:source_doi) + .each { |doi| daily_dois << doi } where(updated_at: date.all_day) .where(target_relation_type_id: RelationTypes::TARGET_RELATION_TYPES) - .distinct.pluck(:target_doi) - .each { |doi| dois << doi } + .where.not(target_doi: [nil, ""]) + .distinct + .pluck(:target_doi) + .each { |doi| daily_dois << doi } # Test performance before enabling SQS queues. - # Parallel.each(dois.to_a, in_threads: threads) do |doi| + # Parallel.each(daily_dois.to_a, in_threads: threads) do |doi| # SqsUtilities.send_events_doi_index_message(doi) # end - total += dois.size + all_dois.merge(daily_dois) end - total + all_dois.size end🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/models/event.rb` around lines 145 - 167, The current code resets the Set per day and queries DOIs per-day, causing duplicates across the full range and including nil/blank DOIs; to fix, build a single Set (dois) outside the loop and query once for the whole range (use updated_at between start_date and end_date) for both source_doi and target_doi, filter out nil/blank values before adding to the Set (or use compact/reject/blank checks), and then return total as dois.size; refer to start_date, end_date, dois, source_doi, target_doi, and RelationTypes::SOURCE_RELATION_TYPES / RelationTypes::TARGET_RELATION_TYPES to locate and update the logic.
🧹 Nitpick comments (1)
app/utilities/sqs_utilities.rb (1)
9-11: Add spec coverage forsend_events_doi_index_message.This new public sender should have a focused spec (queue name +
shoryuken_class+ body serialization), similar to existing coverage forsend_events_other_doi_job_message.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/utilities/sqs_utilities.rb` around lines 9 - 11, Add a focused spec for the new public method send_events_doi_index_message: exercise send_events_doi_index_message with sample payload and assert that it calls send_message (or enqueues) with queue_name "lupo_background", shoryuken_class "ReindexByDoiJob", and that the body is serialized correctly (match JSON structure or stringified payload) — mirror the style of the existing spec for send_events_other_doi_job_message to validate queue, class, and payload serialization.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@app/utilities/sqs_utilities.rb`:
- Around line 9-11: send_events_doi_index_message enqueues to a non-existent
queue/class which will cause silent failures; either add the "lupo_background"
queue to your Shoryuken config (config/shoryuken.yml) and register it, or
implement the worker class ReindexByDoiJob (with a perform method) to process
messages, or add clear documentation in the repo stating this queue/job are
handled externally; update send_events_doi_index_message tests if needed to
reference the implemented ReindexByDoiJob and ensure the queue name
"lupo_background" is present in shoryuken config.
---
Duplicate comments:
In `@app/models/event.rb`:
- Around line 145-167: The current code resets the Set per day and queries DOIs
per-day, causing duplicates across the full range and including nil/blank DOIs;
to fix, build a single Set (dois) outside the loop and query once for the whole
range (use updated_at between start_date and end_date) for both source_doi and
target_doi, filter out nil/blank values before adding to the Set (or use
compact/reject/blank checks), and then return total as dois.size; refer to
start_date, end_date, dois, source_doi, target_doi, and
RelationTypes::SOURCE_RELATION_TYPES / RelationTypes::TARGET_RELATION_TYPES to
locate and update the logic.
---
Nitpick comments:
In `@app/utilities/sqs_utilities.rb`:
- Around line 9-11: Add a focused spec for the new public method
send_events_doi_index_message: exercise send_events_doi_index_message with
sample payload and assert that it calls send_message (or enqueues) with
queue_name "lupo_background", shoryuken_class "ReindexByDoiJob", and that the
body is serialized correctly (match JSON structure or stringified payload) —
mirror the style of the existing spec for send_events_other_doi_job_message to
validate queue, class, and payload serialization.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1735178c-e8ad-4e34-b307-bc840ade018e
📒 Files selected for processing (2)
app/models/event.rbapp/utilities/sqs_utilities.rb
| def send_events_doi_index_message(data) | ||
| send_message(data, shoryuken_class: "ReindexByDoiJob", queue_name: "lupo_background") | ||
| end |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Shoryuken queue groups =="
rg -n '^\s*groups:|^\s*[a-zA-Z_]+:|^\s*queues:|^\s*-\s*[a-zA-Z0-9_]+$' config/shoryuken.yml
echo
echo "== References to lupo_background =="
rg -nP '\blupo_background\b|_lupo_background\b' app config lib spec || true
echo
echo "== ReindexByDoiJob definitions/usages =="
rg -nP '\bclass\s+ReindexByDoiJob\b|\bReindexByDoiJob\b' app lib spec || trueRepository: datacite/events
Length of output: 638
Define lupo_background queue and ReindexByDoiJob worker, or document external handling.
Line 10 enqueues to lupo_background with ReindexByDoiJob, but neither the queue nor the worker class exist in the repository. Messages will fail silently unless this is explicitly handled by an external system. Either add the queue to config/shoryuken.yml, implement the ReindexByDoiJob class, or document that this route is managed externally.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/utilities/sqs_utilities.rb` around lines 9 - 11,
send_events_doi_index_message enqueues to a non-existent queue/class which will
cause silent failures; either add the "lupo_background" queue to your Shoryuken
config (config/shoryuken.yml) and register it, or implement the worker class
ReindexByDoiJob (with a perform method) to process messages, or add clear
documentation in the repo stating this queue/job are handled externally; update
send_events_doi_index_message tests if needed to reference the implemented
ReindexByDoiJob and ensure the queue name "lupo_background" is present in
shoryuken config.
Purpose
This PR adds a rake task that retrieves events for a day or range of days that are relevant to lupo indexed documents (like views, downloads, citations, etc.). It then sends messages to lupo to queue reindexing for touched DOIs. To avoid performance impacts, events are pulled by
updated_atin day increments, only relevant events are selected, and DOI IDs are de-duplicated before being sent to lupo. Companion lupo PR here: datacite/lupo#1492Related to: datacite/datacite#2498
Approach
Open Questions and Pre-Merge TODOs
Learning
Types of changes
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Reviewer, please remember our guidelines:
Summary by CodeRabbit
New Features
Chores