-
Notifications
You must be signed in to change notification settings - Fork 52
Typesense Backend [FC-0091] #222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thanks for the pull request, @bradenmacdonald! This repository is currently maintained by Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review. 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. DetailsWhere can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources: When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
a07e40a to
058d595
Compare
I merged open-craft/tutor-contrib-typesense#4 , so we can use the |
samuelallan72
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Create the new indexes ("collections"):
I think this step is something we can add to tutor-contrib-typesense. Tutor does it for meilisearch - see init/lms.sh.
Or can we make the backend here simply create the index itself on init, like the elasticsearch backend?
It's shaping up nicely! I read through the code and did some playing around, and left several comments and thoughts for consideration. :)
| @@ -0,0 +1,502 @@ | |||
| """ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I noticed is that on my devstack, there were errors indexing the demo course, for both the meilisearch and typesense backends - it was complaining about missing video srt files.
With the meilisearch backend, the demo course was still indexed and I could search content in the course. However, with the typesense backend, search in the demo course contents returned no results at all.
I'm not sure where the source of the issue is though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that too, but I think it's also affecting the Meilisearch instance, as the problematic code is in edx-platform, not edx-search.
To test this:
- Switch back to Meilisearch (
tutor plugins disable typesense,tutor dev start) - Add this code to
meilisearch.py:
def delete_indexes():
"""
Delete the indexes
"""
client = get_meilisearch_client()
for index_name in INDEX_FILTERABLES.keys():
meilisearch_index_name = get_meilisearch_index_name(index_name)
print("Deleting ", meilisearch_index_name)
try:
index = client.get_index(meilisearch_index_name)
except meilisearch.errors.MeilisearchApiError:
continue
task_info = index.delete()
wait_for_task_to_succeed(client, task_info)- Delete the indexes with
tutor dev exec lms ./manage.py lms shell -c "import search.meilisearch; search.meilisearch.delete_indexes()"and re-create them withtutor dev exec lms ./manage.py lms shell -c "import search.meilisearch; search.meilisearch.create_indexes()". - Re-populate the indexes with
tutor dev exec cms ./manage.py cms reindex_course --active
When I do this, I see the same error about .srt files, and the search index is still empty for that course. Same behavior on Meilisearch and Typesense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah interesting! Maybe the course content was indexed one object at a time during the initial import of the demo course, so it did get mostly indexed.
That's probably a useful bug to raise on edx-platform then: make full reindexing more robust.
| # Pray that there are not datetime objects inside lists. | ||
| # If there are, they will be converted to str by the DocumentEncoder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much of the schema for courseware do we know in advance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is just copied from the meilisearch implementation. It's not causing issues there so I left it here. I don't know why they didn't just make it recurse into lists though - it's very easy to fix this problem.
I believe we know the whole schema except the contents of the content object field which can vary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we know this much of the schema, what do you think about declaring the full schema up front, to avoid the extra logic in create_indexes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only logic left in create_indexes is for handling the "extra" fields for datetime and nullable fields, and I think that's a sensible separation of concerns, so I'm inclined to leave it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok no worries. My thought was that we could declare the entire schema up front, declaratively, including the extra nullable and utf offset fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get it, and I want to do that, but I want to do that as part of rewriting edx-search once we've decided which engines we'll support.
That's something I want to postpone until/unless we decide to merge Typesense support, and then yes we should do one of those things or even better make it a migration - see openedx/openedx-platform#36868 |
dd37975 to
a82bf4f
Compare
samuelallan72
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I compared search results between the Meilisearch and Typesense backends (for both course overview and course content searches). My observations:
- Currently the Typesense backend returns a lot less results than the Meilisearch backend. I think this is mostly related to only searching in the "content" field. If I add the "location" field, the more results are returned. I would argue that it's more accurate only searching the "content" field though, and that's a bug in the Meilisearch backend. Including the location field includes many results that don't seem very relevant to the search, just because the section containing the component has the keywords. Not sure how we want to handle that?
- The Typesense backend doesn't appear to be using any word stemming - eg. for the demo course, a search for "captivate" with the Typesense backend only returns one result, while the same search with the Meilisearch backend returns many results, because it pulls in content containing "captivating" too. I think word stemming is going to be important here, otherwise highly relevant content is easily missed. Looks like it needs to be manually enabled per-field for Typesense https://typesense.org/docs/29.0/api/stemming.html - can we do this here for the relevant fields?
- The search results order does differ between the backends - I guess this is expected, as both engines likely use different methods for ranking. Maybe just something we can keep an eye on and tweak if we notice the ranking seems too far off?
Otherwise in general everything seemed to work fine in testing. I think it's pretty close to being ready for use in benchmarking, but I would like to resolve the comments about differing search results above, to ensure it's a 1:1 comparison as much as possible.
| # Pray that there are not datetime objects inside lists. | ||
| # If there are, they will be converted to str by the DocumentEncoder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we know this much of the schema, what do you think about declaring the full schema up front, to avoid the extra logic in create_indexes?
|
For reference, some notes on how I managed to get the demo course content to index successfully, and how I switched quickly between the backends: tutor dev do importdemocourse
# fix the missing srt file that's breaking reindexing the demo course
tutor dev exec cms mkdir -p /openedx/media/video-transcripts/
# NOTE: this is nushell script
let missing_transcripts = [
'058f8868e8264ac889ea22f287f665e2',
'6480f116cdb54afa8272fe28fdd122eb',
'ed47688e702e467e8d4eef40dd390825',
'369f8e4bfa9d4369adcc24ba16932c2d',
'841004754c0e4d00801984618249dfbe',
'd644fb69d0414a71b6521dc25ae73bde',
]
for transcript in $missing_transcripts {
tutor dev exec cms sh -c $'echo "1\n00:00:00,000 --> 00:00:02,500\nWelcome to the Example Subtitle File!" > /openedx/media/video-transcripts/($transcript).srt'
}
# reindex
tutor dev exec lms ./manage.py lms shell -c "import search.typesense; search.typesense.delete_indexes(); search.typesense.create_indexes()"
tutor dev exec cms ./manage.py cms reindex_course --activeTo switch back to meilisearch for testing: first add this to edx-search meilisearch.py: # from Braden on https://github.com/openedx/edx-search/pull/222#discussion_r2350138484
def delete_indexes():
"""
Delete the indexes
"""
client = get_meilisearch_client()
for index_name in INDEX_FILTERABLES.keys():
meilisearch_index_name = get_meilisearch_index_name(index_name)
print("Deleting ", meilisearch_index_name)
try:
index = client.get_index(meilisearch_index_name)
except meilisearch.errors.MeilisearchApiError:
continue
task_info = index.delete()
wait_for_task_to_succeed(client, task_info)Then you can: tutor plugins disable typesense
tutor dev exec lms ./manage.py lms shell -c "import search.meilisearch; search.meilisearch.delete_indexes(); search.meilisearch.create_indexes()"
tutor dev exec cms ./manage.py cms reindex_course --activeAnd to switch back again to typesense: tutor plugins enable typesense
tutor dev exec lms ./manage.py lms shell -c "import search.typesense; search.typesense.delete_indexes(); search.typesense.create_indexes()"
tutor dev exec cms ./manage.py cms reindex_course --active |
Yeah I agree. Today I had to run some unrelated tests with Elasticsearch and I noticed the same thing: Meilisearch returned way more results than Elasticsearch (or Typesense). So I too think there is a bug with how Meilisearch is configured/used. Since the Typesense as ES results seem similar and the extra meilisearch results are likely wrong, I think it's out of scope for this PR, but one of us should look into fixing it on the Meilisearch side if there's time. BTW if you need to test Elasticsearch for any part of this project, I made it very easy: https://github.com/open-craft/tutor-contrib-elasticsearch-legacy
Sounds good.
An even easier way is to go to http://studio.local.openedx.io:8001/admin/edxval/videotranscript/ and delete all the objects. To restore it (and get the broken video subtitles again), just run
Thanks for pointing that out. Fixed in 0fc610f . |
| # included in "content", which is possible (XBlocks can return | ||
| # whatever data they want from their `index_dictionary()` method). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we convert all fields in content to a string before indexing to avoid this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we encounter any errors when indexing content due to this, I'll add it, sure. Should be easy to add. But I'd prefer not to hard-code fixes for specific fields in the convert_doc_datatypes function if we don't have to.
Ah nice, good to know thanks! :) Thanks for adding stemming for the course content field. We may want stemming for the |
| {"name": "language", "type": "string", "facet": True}, | ||
| ], | ||
| # Which fields to use for text matches. Required by Typesense. | ||
| "query_fields": ["course", "org", "number", "content"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bradenmacdonald sorry, can we add stemming for the course and content fields here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is the course name? I think the course name should have stemming but the course portion of the org+course+number code should not, as it's more of an ID then a text field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah my bad, yeah the course field is the course ID, while the content field is an object with fields that includes the overview and display name. Maybe we can add stemming for the content field, similarly to the content field in the other index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done: 4efeab7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it's a little hard to know what is covered by "stemming" and what is covered by "typo tolerance", and what is excluded due to short word length. All three are factors in any given thing. But I do think this should be more similar to Meilisearch now.
|
@bradenmacdonald 👍 I'm happy with this for benchmarking purposes now, thanks!
|
Co-authored-by: Samuel Allan <[email protected]>
4580518 to
1b8601b
Compare
Create the typesense index on an index request, if it doesn't already exist. This ensures that the typesense indexes are always auto-created when they need to be.
|
@bradenmacdonald I added auto-creation of indexes. I implemented it by a method to check if they exist, and create if not, when there is a request to index documents. I decided not to go the route of putting it in If/when we re-architecture edx-search, it would be good to add some init/startup method that can be called once to avoid wasted calls. Alternately, we could move this index creation to Tutor in the future, similar to what it does for the meilisearch index. Please let me know what you think. :) |
|
@samuelallan72 That sounds good to me - thanks! I'l test this out tomorrow. |
|
@samuelallan72 I tested this again today with the automatic index creation (first I deleted all the data), and the API key creation and it's all working well. ✅ |
|
Since this is technically my PR, I can't review it. @samuelallan72 has already reviewed. @ormsbee would you be able to review too and give thumbs up so I can merge? Note that one thing we didn't originally implement was tests for the new Typesense code, although the new |
ormsbee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay. Thanks folks.
|
thanks @ormsbee ! :) I checked the failing test: ERROR: test_aggregating_facet_narrowed_if_single_value_search (search.tests.test_course_discovery.TestMeilisearchCourseDiscoverySearch.test_aggregating_facet_narrowed_if_single_value_search)
...
meilisearch.errors.MeilisearchApiError: MeilisearchApiError. Error code: index_not_found. Error message: Index `test_index` not found. Error documentation: https://docs.meilisearch.com/errors#index_not_found Error type: invalid_requestIt doesn't appear related to this PR, so perhaps it's just flaky. Would you mind retrying the CI? :) |
|
@bradenmacdonald also would you like me to squash the commits into one, or did you have thoughts for splitting into separate commits? |
|
@samuelallan72 No, I prefer to squash on merge, and keep the full commit history in the GitHub PR. But please do rebase this or merge master into it, so that the tests run again, and we can get a green build. |

Summary
This implements Typesense as a backend for
edx-search, specifically to index and search:Testing Instructions
courseware.mfe_courseware_searchwaffle flag is enabled at http://local.openedx.io:8000/admin/waffle/flag/a. Open any course in the Learning MFE
b. Click this button in the top right:
c. Note which courses/keywords/results that work, for comparison with Typesense later.
d. Run a reindex if you're not seeing any results (see step 7 below)
mainversion.tutor dev launch -I; or else if it was already installed just runtutor config saverestart all your devstack containers.--all)a. Go to http://local.openedx.io:8000/courses
b. Run some basic searches and make sure they seem to match the results from earlier in step 2.
c. Test the faceting and filtering using the facet UI (e.g. filter by language, by org, by mode, or all three)
Notes
&& field IS NULLor&& NOT(field EXISTS). This requires some hacks to implement optional fields (like start and end dates) properly. This seems to be a common complaint / pain point with Typesense.text_matchscores are too large to return as JSON numbers, which the other search engines do. For now scores are just ignored. We could still sort results by match score; we just can't return the match values to the MFE.More details on most of these points are in the code comments.
Private ref: BB-9976