Add Trending Field to Solr by benbdeitch · Pull Request #10057 · internetarchive/openlibrary

benbdeitch · 2024-11-20T23:49:02Z

This PR adds support for trending scores to Solr, allowing us to better track which works are achieving a statistically notable increase in popularity. It adds several new fields, and comes with two scripts to be run-- one daily, the other hourly, to keep this information constantly up to date.

Currently, it's still in draft mode, as there is currently no code to automatically run the scripts.

Technical

This implementation uses Solr's ability to update documents in place, which requires the new trending fields to not be stored or indexed, and instead treated as a docValue. Essentially, they are left out of Solr's inverted index, and instead treated as a more usual document-to-value mapping.

This is both A) more performant than atomic updates, and B) avoids the issues that atomic updates can have with copyfield values.

The relevant cron commands are located in an added file, docker/cron.local

Delete your solr container and all related volumes.
Run docker compose up.
Going to your local solr instance, run a search for a work on Solr (e.g. key:"/works/OL54120W"), and check to ensure that the new fields are present.
Save a work to your 'want-to-read' list.
Set up a docker/cron.local file to run the cron jobs in, along with a new container. Change the times on the cron tasks to run more frequently; (* * * * *) will make them run every minute.
Make sure the container has access to both dbnet and webnet networks, and has depends on: db.
After a minute or so, run the search on Solr again, and see if the appropriate trending fields have updated. You can also check the logs of the cron-jobs container in Docker, to see if they're running correctly.

Screenshot

Stakeholders

@cdrini

cdrini

Niiiiice! Getting super close; next week after these changes, let's start adding these fields to prod solr I think!

docker/cron.local

openlibrary/solr/data_provider.py

openlibrary/tests/solr/test_update.py

scripts/calculate_trending_scores_hourly.py

cdrini · 2024-12-11T18:56:31Z

Oh I forgot, also add a dummy override of the get_trending_scores method to

openlibrary/scripts/solr_builder/solr_builder/solr_builder.py

Line 56 in 694bd37

class LocalPostgresDataProvider(DataProvider):

with 0s

cdrini · 2025-01-10T16:24:10Z

conf/solr/conf/managed-schema.xml

    <field name="ratings_count_5" type="pint"/>

+    <!-- Trending related values-->
+    <field name = "trending_score_hourly_0" type="pint" indexed="false" stored ="false"/>


Note: The field settings here are something we might want to investigate if we see perf issues.

cdrini · 2025-01-10T17:58:59Z

We created these fields on our staging solr, and then pulled down the code to run it on ol-home0. It hit a few snags:

The 10s default timeout in openlibrary/utils/solr.py was too short for the /export call; we'll need to a way to extend this in this case -- but not always
We hit this error:

Traceback (most recent call last):
  File "/openlibrary/scripts/calculate_trending_scores_hourly.py", line 139, in <module>
    FnToCLI(main).run()
  File "/openlibrary/scripts/solr_builder/solr_builder/fn_to_cli.py", line 89, in run
    return self.fn(**args_dicts)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/openlibrary/scripts/calculate_trending_scores_hourly.py", line 122, in main
    form_inplace_updates(
  File "/openlibrary/scripts/calculate_trending_scores_hourly.py", line 103, in form_inplace_updates
    "set": solr_doc["trending_score_hourly_sum"]
           ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'trending_score_hourly_sum'

Seems like a method had its types updated and a reference wasn't updated accordingly.

Next steps:

Give the code another CR, we think some fixes may have gotten lost in the rebase
Run it locally one more time to see if things are going smoothly

mekarpeles · 2025-05-29T19:06:14Z

There are around ~50k unique works per 10 days
SELECT COUNT(DISTINCT(work_id)) FROM bookshelves_books WHERE updated >= localtimestamp - interval '10 days'

This will get us 10 days of anonymized data for computing trending on ol-db1

sudo -u postgres psql -d openlibrary -c "\copy (SELECT work_id, bookshelf_id, edition_id, updated, created FROM bookshelves_books WHERE updated >= localtimestamp - interval '10 days') TO '/tmp/trending-10day.csv' WITH CSV HEADER"

From ol-dev1:

scp ol-db1.us.archive.org:/tmp/trending-10day.csv /opt/openlibrary/static/

On local instances, this can be imported with:

curl https://testing.openlibrary.org/static/trending-10day.csv -o /tmp/trending-10day.csv
docker cp /tmp/trending-10day.csv openlibrary-db-1:/tmp
docker exec -it openlibrary-db-1 bash

And then in docker:

psql -U postgres -d openlibrary
ALTER TABLE bookshelves_books ALTER COLUMN username SET DEFAULT MD5(random()::text);
COPY bookshelves_books (work_id, bookshelf_id, edition_id, updated, created)FROM '/tmp/trending-10day.csv' WITH (FORMAT CSV, HEADER TRUE);
ALTER TABLE bookshelves_books ALTER COLUMN username DROP DEFAULT;

cdrini · 2025-06-12T19:28:50Z

Note on next steps:

Update code to have a new solr field, trending_timestamp that's the timestamp to which the trending fields are relative to. This will let us monitor if it hasn't run correctly.

Make sure the scripts run locally. Add some books to reading log, check the numbers are updating.
Ensure that our new in place updating of solr records does not delete any data.
Add the new fields to staging solr
Get latest anonymized data using script above and load into local DB. Point the local instance to the staging solr. And then run the scripts.
- Expect the trending numbers to populate in staging solr
- We can also search in the local environment and sort by trending to see if it has reasonable results
Ensure the cron-wiring will also work on prod
Pull down this PR into solr-updater-next on ol-home0 to ensure cron works correctly.
Monitoring testing (make sure pointed to ol-solr1!) to see how the numbers are looking. Monitor for ~a week to make sure the numbers are updating/moving through
Add the new fields to prod solr
Merge the PR & Deploy.

mekarpeles · 2025-06-17T18:24:22Z

Trending README

TODO:

Verifying Z-score?

Solr Fields

24 slots of trending_score_hourly
1 trending_score_hourly_sum
7 slots of trending_score_daily
1 overall trending_z_score - based on trending_score_hourly_sum & 7 trending_score_dailies

Components

Two scripts (cron):

Solr Updater:

data_provider.py -- we need to add a new method to solr provider so that it can fetch data from itself (solr) not just the database, archive.org metadata, etc.

How it works: Crons & Updating sliding window

Every hour we query the database and add updates counts for the current hour into the currently selected hour bucket. If/when we hit the last bucket, we'll roll over and re-use the first bucket. It's the job of daily.py to aggregate these hours counts for the last 24h and turn them into a sum for the current day. When we progress 1 hour, one hour will fall out of the sliding window and so it's contribution to the sum will be subtracted, and the new hour we compute will be added. This avoids having to recalcuate the other 22 hours which have unchanged (i.e. only the head and tail of the 24 hours are affected at any given hour change). Daily will always clobber the head day and not have to worry about the tail day.

codecov-commenter · 2025-06-17T23:54:49Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 17.10%. Comparing base (e027195) to head (ab41d53).
Report is 99 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10057      +/-   ##
==========================================
- Coverage   17.14%   17.10%   -0.05%     
==========================================
  Files          91       91              
  Lines        4981     5000      +19     
  Branches      867      870       +3     
==========================================
+ Hits          854      855       +1     
- Misses       3588     3603      +15     
- Partials      539      542       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull Request Overview

This PR adds support for trending scores in Solr by introducing new schema fields, update-in-place support in the Solr client, data fetching in the data provider, and scheduled hourly/daily updater scripts.

Defines 24 hourly score fields, a daily score slot per weekday, a rolling sum, and a z-score in the Solr schema and typed dict.
Implements update_in_place in the Solr client and enriches the work builder to include trending data when enabled.
Adds standalone and integrated scheduler in scripts/utils/scheduler.py plus hourly and daily updater scripts under scripts/solr_updater/.

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
scripts/utils/scheduler.py	New `OlAsyncIOScheduler` with logging and optional Sentry monitoring.
scripts/solr_updater/*.py	Hourly/daily trending updater scripts and an orchestration entry.
scripts/solr_updater/solr_updater.py	Hooks in the trending scheduler into the main Solr-updater workflow.
openlibrary/utils/solr.py	Adds `update_in_place` method for partial document updates.
openlibrary/solr/data_provider.py	Adds methods to fetch trending scores from Solr.
openlibrary/solr/updater/work.py	Merges trending scores into the Solr document when `get_solr_next()`.
openlibrary/solr/solr_types.py	Declares new trending fields in the typed dictionary.
conf/solr/conf/managed-schema.xml	Declares new fields for trending scores (hourly, daily, sum, z-score).

Comments suppressed due to low confidence (2)

openlibrary/utils/solr.py:76

Passing the Python bool commit directly in the URL may produce True/False, which Solr may not recognize. Consider converting it to lowercase strings ('true'/'false') or to int(commit).

},

openlibrary/solr/data_provider.py:372

The returned dict omits the trending_z_score key, which exists in the schema and typed dict. Consider adding trending_z_score with a default (e.g., 0.0) to ensure consumers always see this field.

            f'trending_score_hourly_{index}': reply.get(

openlibrary/solr/data_provider.py

conf/solr/conf/managed-schema.xml

cdrini

This will be merged but behind the get_solr_next() flag. Local environments will need to update their solr to support the new schema. Prod solr will not be affected until we change prod's get_solr_next() flag as we prepare for the next full reindex.

We also have an issue that the get_trending_data makes individual solr requests for every work; it's become the bottle neck. That should be updated to preload/batch like the the other data provider methods.

The results can be monitored on testing here: https://testing.openlibrary.org/search?q=trending_score_hourly_sum%3A[1+TO+*]&mode=everything&sort=trending

… updating.

…function names.

Co-authored-by: Drini Cami <cdrini@gmail.com>

Co-authored-by: Michael E. Karpeles (mek) <michael.karpeles@gmail.com>

+ Fix issue with _init_path no longer being accessible due to nesting

… solr builder

Replace the old --solr-url CLI argument, which does not affect other spots where solr is initialized, like get_solr()

github-actions bot assigned cdrini Nov 20, 2024

github-actions bot added the Priority: 2 Important, as time permits. [managed] label Nov 20, 2024

benbdeitch force-pushed the third-try-trending branch from 1c515e9 to 1226bbf Compare November 28, 2024 23:15

mekarpeles mentioned this pull request Dec 2, 2024

7429/feature/add trending score to solr #9878

Closed

benbdeitch force-pushed the third-try-trending branch from aecc6fd to 5990936 Compare December 4, 2024 18:01

cdrini requested changes Dec 11, 2024

View reviewed changes

cdrini reviewed Jan 10, 2025

View reviewed changes

benbdeitch force-pushed the third-try-trending branch from b16269d to 44ce878 Compare January 10, 2025 16:56

cdrini added Needs: Special Deploy This PR will need a non-standard deploy to production Needs: Testing labels Jan 10, 2025

cdrini force-pushed the third-try-trending branch from 868f029 to 69f15d5 Compare June 17, 2025 22:29

github-actions bot added the Needs: Response Issues which require feedback from lead label Jun 18, 2025

cdrini force-pushed the third-try-trending branch 5 times, most recently from 8dc1267 to e2c3bda Compare June 18, 2025 18:44

cdrini requested a review from Copilot June 18, 2025 18:47

Copilot AI reviewed Jun 18, 2025

View reviewed changes

openlibrary/solr/data_provider.py Outdated Show resolved Hide resolved

conf/solr/conf/managed-schema.xml Outdated Show resolved Hide resolved

cdrini force-pushed the third-try-trending branch 2 times, most recently from 750b3f8 to f808466 Compare June 19, 2025 02:34

cdrini approved these changes Jun 19, 2025

View reviewed changes

cdrini force-pushed the third-try-trending branch from f808466 to f3ca7cd Compare June 19, 2025 13:13

Redone commits.

d1fcda3

benbdeitch and others added 19 commits June 19, 2025 15:17

Added functionality for WorkSolrUpdater to call fields from Solr when…

bad5744

… updating.

Adjusted DataProvider to use a call to Solr when updating a work.

1e8f64f

Fixed import statements for the new precommit rules.

0de5041

Ruff linting fixes.

d7c253e

Fixed erroneous SQL, mismatched dataprovider functions, and improved …

bfaa32b

…function names.

Removed residue

32b1aaf

Delete docker/cron.local

29baaaa

Update scripts/calculate_trending_scores_daily.py

bf6dd77

Co-authored-by: Drini Cami <cdrini@gmail.com>

Code review, misc cleanups, and trending initialization script

2a4a7a0

Co-authored-by: Michael E. Karpeles (mek) <michael.karpeles@gmail.com>

Move trending and solr_updater into new dir: scripts/solr_updater

6e25510

Move trending updater scheduling into solr-updater

ea13421

+ Fix issue with _init_path no longer being accessible due to nesting

Fix wget no longer in olbase

abb6e08

Move OlAsyncIOScheduler to shared utils dir

94c67a6

Add sentry support to OlAsyncIOScheduler + trending updater

a125924

Replace solr-updater --trending with --solr-next

9b49933

Move get_solr_trending_scores off DataProvider base class; will break…

c860c78

… solr builder

Print scheduler job exception if it errors

7747dc9

Make get_trending_data async + fix query + use /get

c7a9766

Create OL_SOLR_BASE_URL env param for solr-updater

13c1a24

Replace the old --solr-url CLI argument, which does not affect other spots where solr is initialized, like get_solr()

cdrini force-pushed the third-try-trending branch from f3ca7cd to 13c1a24 Compare June 19, 2025 13:18

cdrini merged commit a8ccac2 into internetarchive:master Jun 19, 2025
4 checks passed

RayBB mentioned this pull request Jul 8, 2025

Add copilot instructions for GH code reviews #11006

Open

Uh oh!

Conversation

benbdeitch commented Nov 20, 2024

Technical

Screenshot

Stakeholders

Uh oh!

cdrini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cdrini commented Dec 11, 2024

Uh oh!

cdrini Jan 10, 2025

Choose a reason for hiding this comment

Uh oh!

cdrini commented Jan 10, 2025

Uh oh!

mekarpeles commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cdrini commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mekarpeles commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Trending README

TODO:

Solr Fields

Components

How it works: Crons & Updating sliding window

Uh oh!

codecov-commenter commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

cdrini left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mekarpeles commented May 29, 2025 •

edited

Loading

cdrini commented Jun 12, 2025 •

edited

Loading

mekarpeles commented Jun 17, 2025 •

edited

Loading

codecov-commenter commented Jun 17, 2025 •

edited

Loading

cdrini left a comment •

edited

Loading