Skip to content

Add Trending Field to Solr #10057

Merged
cdrini merged 20 commits intointernetarchive:masterfrom
benbdeitch:third-try-trending
Jun 19, 2025
Merged

Add Trending Field to Solr #10057
cdrini merged 20 commits intointernetarchive:masterfrom
benbdeitch:third-try-trending

Conversation

@benbdeitch
Copy link
Collaborator

Closes #7429

This PR adds support for trending scores to Solr, allowing us to better track which works are achieving a statistically notable increase in popularity. It adds several new fields, and comes with two scripts to be run-- one daily, the other hourly, to keep this information constantly up to date.

Currently, it's still in draft mode, as there is currently no code to automatically run the scripts.

Technical

This implementation uses Solr's ability to update documents in place, which requires the new trending fields to not be stored or indexed, and instead treated as a docValue. Essentially, they are left out of Solr's inverted index, and instead treated as a more usual document-to-value mapping.

This is both A) more performant than atomic updates, and B) avoids the issues that atomic updates can have with copyfield values.

The relevant cron commands are located in an added file, docker/cron.local

  1. Delete your solr container and all related volumes.
  2. Run docker compose up.
  3. Going to your local solr instance, run a search for a work on Solr (e.g. key:"/works/OL54120W"), and check to ensure that the new fields are present.
  4. Save a work to your 'want-to-read' list.
  5. Set up a docker/cron.local file to run the cron jobs in, along with a new container. Change the times on the cron tasks to run more frequently; (* * * * *) will make them run every minute.
  6. Make sure the container has access to both dbnet and webnet networks, and has depends on: db.
  7. After a minute or so, run the search on Solr again, and see if the appropriate trending fields have updated. You can also check the logs of the cron-jobs container in Docker, to see if they're running correctly.

Screenshot

Stakeholders

@cdrini

@github-actions github-actions bot added the Priority: 2 Important, as time permits. [managed] label Nov 20, 2024
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Niiiiice! Getting super close; next week after these changes, let's start adding these fields to prod solr I think!

@cdrini
Copy link
Collaborator

cdrini commented Dec 11, 2024

Oh I forgot, also add a dummy override of the get_trending_scores method to

class LocalPostgresDataProvider(DataProvider):
with 0s

<field name="ratings_count_5" type="pint"/>

<!-- Trending related values-->
<field name = "trending_score_hourly_0" type="pint" indexed="false" stored ="false"/>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: The field settings here are something we might want to investigate if we see perf issues.

@cdrini
Copy link
Collaborator

cdrini commented Jan 10, 2025

We created these fields on our staging solr, and then pulled down the code to run it on ol-home0. It hit a few snags:

  1. The 10s default timeout in openlibrary/utils/solr.py was too short for the /export call; we'll need to a way to extend this in this case -- but not always
  2. We hit this error:
Traceback (most recent call last):
  File "/openlibrary/scripts/calculate_trending_scores_hourly.py", line 139, in <module>
    FnToCLI(main).run()
  File "/openlibrary/scripts/solr_builder/solr_builder/fn_to_cli.py", line 89, in run
    return self.fn(**args_dicts)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/openlibrary/scripts/calculate_trending_scores_hourly.py", line 122, in main
    form_inplace_updates(
  File "/openlibrary/scripts/calculate_trending_scores_hourly.py", line 103, in form_inplace_updates
    "set": solr_doc["trending_score_hourly_sum"]
           ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'trending_score_hourly_sum'

Seems like a method had its types updated and a reference wasn't updated accordingly.

Next steps:

  1. Give the code another CR, we think some fixes may have gotten lost in the rebase
  2. Run it locally one more time to see if things are going smoothly

@cdrini cdrini added Needs: Special Deploy This PR will need a non-standard deploy to production Needs: Testing labels Jan 10, 2025
@mekarpeles
Copy link
Member

mekarpeles commented May 29, 2025

There are around ~50k unique works per 10 days
SELECT COUNT(DISTINCT(work_id)) FROM bookshelves_books WHERE updated >= localtimestamp - interval '10 days'

This will get us 10 days of anonymized data for computing trending on ol-db1

sudo -u postgres psql -d openlibrary -c "\copy (SELECT work_id, bookshelf_id, edition_id, updated, created FROM bookshelves_books WHERE updated >= localtimestamp - interval '10 days') TO '/tmp/trending-10day.csv' WITH CSV HEADER"

From ol-dev1:

scp ol-db1.us.archive.org:/tmp/trending-10day.csv /opt/openlibrary/static/

On local instances, this can be imported with:

curl https://testing.openlibrary.org/static/trending-10day.csv -o /tmp/trending-10day.csv
docker cp /tmp/trending-10day.csv openlibrary-db-1:/tmp
docker exec -it openlibrary-db-1 bash

And then in docker:

psql -U postgres -d openlibrary
ALTER TABLE bookshelves_books ALTER COLUMN username SET DEFAULT MD5(random()::text);
COPY bookshelves_books (work_id, bookshelf_id, edition_id, updated, created)FROM '/tmp/trending-10day.csv' WITH (FORMAT CSV, HEADER TRUE);
ALTER TABLE bookshelves_books ALTER COLUMN username DROP DEFAULT;

@cdrini
Copy link
Collaborator

cdrini commented Jun 12, 2025

Note on next steps:

Update code to have a new solr field, trending_timestamp that's the timestamp to which the trending fields are relative to. This will let us monitor if it hasn't run correctly.

  1. Make sure the scripts run locally. Add some books to reading log, check the numbers are updating.
  2. Ensure that our new in place updating of solr records does not delete any data.
  3. Add the new fields to staging solr
  4. Get latest anonymized data using script above and load into local DB. Point the local instance to the staging solr. And then run the scripts.
    • Expect the trending numbers to populate in staging solr
    • We can also search in the local environment and sort by trending to see if it has reasonable results
  5. Ensure the cron-wiring will also work on prod
  6. Pull down this PR into solr-updater-next on ol-home0 to ensure cron works correctly.
  7. Monitoring testing (make sure pointed to ol-solr1!) to see how the numbers are looking. Monitor for ~a week to make sure the numbers are updating/moving through
  8. Add the new fields to prod solr
  9. Merge the PR & Deploy.

@mekarpeles
Copy link
Member

mekarpeles commented Jun 17, 2025

Trending README

TODO:

  • Verifying Z-score?

Solr Fields

  • 24 slots of trending_score_hourly
  • 1 trending_score_hourly_sum
  • 7 slots of trending_score_daily
  • 1 overall trending_z_score - based on trending_score_hourly_sum & 7 trending_score_dailies

Components

  1. Two scripts (cron):
  1. Solr Updater:
  • data_provider.py -- we need to add a new method to solr provider so that it can fetch data from itself (solr) not just the database, archive.org metadata, etc.

How it works: Crons & Updating sliding window

Every hour we query the database and add updates counts for the current hour into the currently selected hour bucket. If/when we hit the last bucket, we'll roll over and re-use the first bucket. It's the job of daily.py to aggregate these hours counts for the last 24h and turn them into a sum for the current day. When we progress 1 hour, one hour will fall out of the sliding window and so it's contribution to the sum will be subtracted, and the new hour we compute will be added. This avoids having to recalcuate the other 22 hours which have unchanged (i.e. only the head and tail of the 24 hours are affected at any given hour change). Daily will always clobber the head day and not have to worry about the tail day.

@cdrini cdrini force-pushed the third-try-trending branch from 868f029 to 69f15d5 Compare June 17, 2025 22:29
@codecov-commenter
Copy link

codecov-commenter commented Jun 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 17.10%. Comparing base (e027195) to head (ab41d53).
Report is 99 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10057      +/-   ##
==========================================
- Coverage   17.14%   17.10%   -0.05%     
==========================================
  Files          91       91              
  Lines        4981     5000      +19     
  Branches      867      870       +3     
==========================================
+ Hits          854      855       +1     
- Misses       3588     3603      +15     
- Partials      539      542       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Jun 18, 2025
@cdrini cdrini force-pushed the third-try-trending branch 5 times, most recently from 8dc1267 to e2c3bda Compare June 18, 2025 18:44
@cdrini cdrini requested a review from Copilot June 18, 2025 18:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for trending scores in Solr by introducing new schema fields, update-in-place support in the Solr client, data fetching in the data provider, and scheduled hourly/daily updater scripts.

  • Defines 24 hourly score fields, a daily score slot per weekday, a rolling sum, and a z-score in the Solr schema and typed dict.
  • Implements update_in_place in the Solr client and enriches the work builder to include trending data when enabled.
  • Adds standalone and integrated scheduler in scripts/utils/scheduler.py plus hourly and daily updater scripts under scripts/solr_updater/.

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scripts/utils/scheduler.py New OlAsyncIOScheduler with logging and optional Sentry monitoring.
scripts/solr_updater/*.py Hourly/daily trending updater scripts and an orchestration entry.
scripts/solr_updater/solr_updater.py Hooks in the trending scheduler into the main Solr-updater workflow.
openlibrary/utils/solr.py Adds update_in_place method for partial document updates.
openlibrary/solr/data_provider.py Adds methods to fetch trending scores from Solr.
openlibrary/solr/updater/work.py Merges trending scores into the Solr document when get_solr_next().
openlibrary/solr/solr_types.py Declares new trending fields in the typed dictionary.
conf/solr/conf/managed-schema.xml Declares new fields for trending scores (hourly, daily, sum, z-score).
Comments suppressed due to low confidence (2)

openlibrary/utils/solr.py:76

  • Passing the Python bool commit directly in the URL may produce True/False, which Solr may not recognize. Consider converting it to lowercase strings ('true'/'false') or to int(commit).
            },

openlibrary/solr/data_provider.py:372

  • The returned dict omits the trending_z_score key, which exists in the schema and typed dict. Consider adding trending_z_score with a default (e.g., 0.0) to ensure consumers always see this field.
            f'trending_score_hourly_{index}': reply.get(

@cdrini cdrini force-pushed the third-try-trending branch 2 times, most recently from 750b3f8 to f808466 Compare June 19, 2025 02:34
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be merged but behind the get_solr_next() flag. Local environments will need to update their solr to support the new schema. Prod solr will not be affected until we change prod's get_solr_next() flag as we prepare for the next full reindex.

We also have an issue that the get_trending_data makes individual solr requests for every work; it's become the bottle neck. That should be updated to preload/batch like the the other data provider methods.

The results can be monitored on testing here: https://testing.openlibrary.org/search?q=trending_score_hourly_sum%3A[1+TO+*]&mode=everything&sort=trending

@cdrini cdrini force-pushed the third-try-trending branch from f808466 to f3ca7cd Compare June 19, 2025 13:13
benbdeitch and others added 19 commits June 19, 2025 15:17
Co-authored-by: Drini Cami <cdrini@gmail.com>
Co-authored-by: Michael E. Karpeles (mek) <michael.karpeles@gmail.com>
+ Fix issue with _init_path no longer being accessible due to nesting
Replace the old --solr-url CLI argument, which does not affect
other spots where solr is initialized, like get_solr()
@cdrini cdrini force-pushed the third-try-trending branch from f3ca7cd to 13c1a24 Compare June 19, 2025 13:18
@cdrini cdrini merged commit a8ccac2 into internetarchive:master Jun 19, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs: Response Issues which require feedback from lead Needs: Special Deploy This PR will need a non-standard deploy to production Needs: Testing Priority: 2 Important, as time permits. [managed]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add trending score to solr

4 participants