Add Trending Field to Solr #10057
Conversation
1c515e9 to
1226bbf
Compare
aecc6fd to
5990936
Compare
cdrini
left a comment
There was a problem hiding this comment.
Niiiiice! Getting super close; next week after these changes, let's start adding these fields to prod solr I think!
|
Oh I forgot, also add a dummy override of the |
conf/solr/conf/managed-schema.xml
Outdated
| <field name="ratings_count_5" type="pint"/> | ||
|
|
||
| <!-- Trending related values--> | ||
| <field name = "trending_score_hourly_0" type="pint" indexed="false" stored ="false"/> |
There was a problem hiding this comment.
Note: The field settings here are something we might want to investigate if we see perf issues.
b16269d to
44ce878
Compare
|
We created these fields on our staging solr, and then pulled down the code to run it on ol-home0. It hit a few snags:
Seems like a method had its types updated and a reference wasn't updated accordingly. Next steps:
|
|
There are around ~50k unique works per 10 days This will get us 10 days of anonymized data for computing trending on From On local instances, this can be imported with: And then in docker: |
|
Note on next steps: Update code to have a new solr field,
|
Trending READMETODO:
Solr Fields
Components
How it works: Crons & Updating sliding windowEvery hour we query the database and add updates counts for the current hour into the currently selected hour bucket. If/when we hit the last bucket, we'll roll over and re-use the first bucket. It's the job of daily.py to aggregate these hours counts for the last 24h and turn them into a sum for the current day. When we progress 1 hour, one hour will fall out of the sliding window and so it's contribution to the sum will be subtracted, and the new hour we compute will be added. This avoids having to recalcuate the other 22 hours which have unchanged (i.e. only the head and tail of the 24 hours are affected at any given hour change). Daily will always clobber the head day and not have to worry about the tail day. |
868f029 to
69f15d5
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #10057 +/- ##
==========================================
- Coverage 17.14% 17.10% -0.05%
==========================================
Files 91 91
Lines 4981 5000 +19
Branches 867 870 +3
==========================================
+ Hits 854 855 +1
- Misses 3588 3603 +15
- Partials 539 542 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
8dc1267 to
e2c3bda
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR adds support for trending scores in Solr by introducing new schema fields, update-in-place support in the Solr client, data fetching in the data provider, and scheduled hourly/daily updater scripts.
- Defines 24 hourly score fields, a daily score slot per weekday, a rolling sum, and a z-score in the Solr schema and typed dict.
- Implements
update_in_placein the Solr client and enriches the work builder to include trending data when enabled. - Adds standalone and integrated scheduler in
scripts/utils/scheduler.pyplus hourly and daily updater scripts underscripts/solr_updater/.
Reviewed Changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/utils/scheduler.py | New OlAsyncIOScheduler with logging and optional Sentry monitoring. |
| scripts/solr_updater/*.py | Hourly/daily trending updater scripts and an orchestration entry. |
| scripts/solr_updater/solr_updater.py | Hooks in the trending scheduler into the main Solr-updater workflow. |
| openlibrary/utils/solr.py | Adds update_in_place method for partial document updates. |
| openlibrary/solr/data_provider.py | Adds methods to fetch trending scores from Solr. |
| openlibrary/solr/updater/work.py | Merges trending scores into the Solr document when get_solr_next(). |
| openlibrary/solr/solr_types.py | Declares new trending fields in the typed dictionary. |
| conf/solr/conf/managed-schema.xml | Declares new fields for trending scores (hourly, daily, sum, z-score). |
Comments suppressed due to low confidence (2)
openlibrary/utils/solr.py:76
- Passing the Python bool
commitdirectly in the URL may produceTrue/False, which Solr may not recognize. Consider converting it to lowercase strings ('true'/'false') or toint(commit).
},
openlibrary/solr/data_provider.py:372
- The returned dict omits the
trending_z_scorekey, which exists in the schema and typed dict. Consider addingtrending_z_scorewith a default (e.g., 0.0) to ensure consumers always see this field.
f'trending_score_hourly_{index}': reply.get(
750b3f8 to
f808466
Compare
There was a problem hiding this comment.
This will be merged but behind the get_solr_next() flag. Local environments will need to update their solr to support the new schema. Prod solr will not be affected until we change prod's get_solr_next() flag as we prepare for the next full reindex.
We also have an issue that the get_trending_data makes individual solr requests for every work; it's become the bottle neck. That should be updated to preload/batch like the the other data provider methods.
The results can be monitored on testing here: https://testing.openlibrary.org/search?q=trending_score_hourly_sum%3A[1+TO+*]&mode=everything&sort=trending
f808466 to
f3ca7cd
Compare
Co-authored-by: Drini Cami <cdrini@gmail.com>
Co-authored-by: Michael E. Karpeles (mek) <michael.karpeles@gmail.com>
+ Fix issue with _init_path no longer being accessible due to nesting
Replace the old --solr-url CLI argument, which does not affect other spots where solr is initialized, like get_solr()
f3ca7cd to
13c1a24
Compare
Closes #7429
This PR adds support for trending scores to Solr, allowing us to better track which works are achieving a statistically notable increase in popularity. It adds several new fields, and comes with two scripts to be run-- one daily, the other hourly, to keep this information constantly up to date.
Currently, it's still in draft mode, as there is currently no code to automatically run the scripts.
Technical
This implementation uses Solr's ability to update documents in place, which requires the new trending fields to not be stored or indexed, and instead treated as a
docValue. Essentially, they are left out of Solr's inverted index, and instead treated as a more usual document-to-value mapping.This is both A) more performant than atomic updates, and B) avoids the issues that atomic updates can have with copyfield values.
The relevant cron commands are located in an added file,
docker/cron.localdocker compose up.key:"/works/OL54120W"), and check to ensure that the new fields are present.docker/cron.localfile to run the cron jobs in, along with a new container. Change the times on the cron tasks to run more frequently; (* * * * *) will make them run every minute.dbnetandwebnetnetworks, and hasdepends on: db.Screenshot
Stakeholders
@cdrini