Skip to content

Conversation

@bradenmacdonald
Copy link

@bradenmacdonald bradenmacdonald commented Aug 27, 2025

Summary

This implements Typesense as a backend for edx-search, specifically to index and search:

  • The list of courses on the instance
  • The content of those courses

Testing Instructions

  1. Make sure the courseware.mfe_courseware_search waffle flag is enabled at http://local.openedx.io:8000/admin/waffle/flag/
  2. Before activating Typesense or this PR, go to http://local.openedx.io:8000/courses and verify that the course list search is working.
  3. Before activating Typesense or this PR, Test the courseware search:
    a. Open any course in the Learning MFE
    b. Click this button in the top right:
    Screenshot 2025-09-09 at 7 12 20 PM
    c. Note which courses/keywords/results that work, for comparison with Typesense later.
    d. Run a reindex if you're not seeing any results (see step 7 below)
  4. Configure tutor-contrib-typesense - be sure to use the latest main version.
  5. If you just installed that plugin, run tutor dev launch -I; or else if it was already installed just run tutor config save restart all your devstack containers.
  6. Create the new indexes ("collections"):
    tutor dev exec lms ./manage.py lms shell -c "import search.typesense; search.typesense.create_indexes()"
    
  7. Reindex your courseware content:
    tutor dev exec cms ./manage.py cms reindex_course --active
    
    (or --all)
  8. Inspect the schemas and confirm that the indexes were populated:
    tutor dev exec lms ./manage.py lms shell
    from search import typesense as ts
    c = ts.get_typesense_client()
    
    [(f['name'], f['type']) for f in c.collections['tutor_course_info'].retrieve()['fields']]
    c.collections['tutor_course_info'].retrieve()['num_documents']
    
    [(f['name'], f['type']) for f in c.collections['tutor_courseware_content'].retrieve()['fields']]
    c.collections['tutor_courseware_content'].retrieve()['num_documents']
  9. Test the courses search:
    a. Go to http://local.openedx.io:8000/courses
    b. Run some basic searches and make sure they seem to match the results from earlier in step 2.
    c. Test the faceting and filtering using the facet UI (e.g. filter by language, by org, by mode, or all three)
  10. Test the courseware search: repeat step 3, this time using Typesense. Make sure the results look similar.

Notes

  • Typesense does not allow filtering like && field IS NULL or && NOT(field EXISTS). This requires some hacks to implement optional fields (like start and end dates) properly. This seems to be a common complaint / pain point with Typesense.
  • Typsense text_match scores are too large to return as JSON numbers, which the other search engines do. For now scores are just ignored. We could still sort results by match score; we just can't return the match values to the MFE.
  • TypeSense requires us to specify which text fields the 'q' query will search, for ALL searches, even if we're not specifying a query text (i.e. even if we're just using filters).
  • Sorting of results was added in feat: [FC-86] add sorting feature for engines + enable_course_sorting_by_star… #205 but not used in either of the two use cases we're evaluating. (Presumably it will be used in the newer version of the course discovery MFE which is in early development.)
  • This PR does not include the faceting changes from feat: [FC-86] implement multivalue faceted search using Meilisearch and Elasticsearch #213
  • Typesense requires manually specifying which text fields to do "stemming" on. Meilisearch is vague about what kind of stemming they do, but it's done automatically.

More details on most of these points are in the code comments.

Private ref: BB-9976

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Aug 27, 2025
@openedx-webhooks
Copy link

openedx-webhooks commented Aug 27, 2025

Thanks for the pull request, @bradenmacdonald!

This repository is currently maintained by @Ali-Salman29.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

  • If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
    • This process (including the steps you'll need to take) is documented here.
  • If it doesn't, simply proceed with the next step.
🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

  • Dependencies

    This PR must be merged before / after / at the same time as ...

  • Blockers

    This PR is waiting for OEP-1234 to be accepted.

  • Timeline information

    This PR must be merged by XX date because ...

  • Partner information

    This is for a course on edx.org.

  • Supporting documentation
  • Relevant Open edX discussion forum threads
🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Details
Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

  • The size and impact of the changes that it introduces
  • The need for product review
  • Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

@openedx-webhooks openedx-webhooks added the core contributor PR author is a Core Contributor (who may or may not have write access to this repo). label Aug 27, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in Contributions Aug 27, 2025
@mphilbrick211 mphilbrick211 moved this from Needs Triage to Waiting on Author in Contributions Sep 3, 2025
@mphilbrick211 mphilbrick211 added the FC Relates to an Axim Funded Contribution project label Sep 3, 2025
@bradenmacdonald bradenmacdonald changed the title Typesense Backend Typesense Backend [FC-0091] Sep 8, 2025
@bradenmacdonald bradenmacdonald force-pushed the typesense branch 3 times, most recently from a07e40a to 058d595 Compare September 10, 2025 02:08
@samuelallan72 samuelallan72 self-requested a review September 11, 2025 02:38
@samuelallan72
Copy link

@bradenmacdonald

Configure tutor-contrib-typesense with this PR checked out:

I merged open-craft/tutor-contrib-typesense#4 , so we can use the main branch of tutor-contrib-typesense now. :)

Copy link

@samuelallan72 samuelallan72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Create the new indexes ("collections"):

I think this step is something we can add to tutor-contrib-typesense. Tutor does it for meilisearch - see init/lms.sh.

Or can we make the backend here simply create the index itself on init, like the elasticsearch backend?


It's shaping up nicely! I read through the code and did some playing around, and left several comments and thoughts for consideration. :)

@@ -0,0 +1,502 @@
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I noticed is that on my devstack, there were errors indexing the demo course, for both the meilisearch and typesense backends - it was complaining about missing video srt files.

With the meilisearch backend, the demo course was still indexed and I could search content in the course. However, with the typesense backend, search in the demo course contents returned no results at all.

I'm not sure where the source of the issue is though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that too, but I think it's also affecting the Meilisearch instance, as the problematic code is in edx-platform, not edx-search.

To test this:

  1. Switch back to Meilisearch (tutor plugins disable typesense, tutor dev start)
  2. Add this code to meilisearch.py:
def delete_indexes():
    """
    Delete the indexes
    """
    client = get_meilisearch_client()
    for index_name in INDEX_FILTERABLES.keys():
        meilisearch_index_name = get_meilisearch_index_name(index_name)
        print("Deleting ", meilisearch_index_name)
        try:
            index = client.get_index(meilisearch_index_name)
        except meilisearch.errors.MeilisearchApiError:
            continue
        task_info = index.delete()
        wait_for_task_to_succeed(client, task_info)
  1. Delete the indexes with tutor dev exec lms ./manage.py lms shell -c "import search.meilisearch; search.meilisearch.delete_indexes()" and re-create them with tutor dev exec lms ./manage.py lms shell -c "import search.meilisearch; search.meilisearch.create_indexes()".
  2. Re-populate the indexes with tutor dev exec cms ./manage.py cms reindex_course --active

When I do this, I see the same error about .srt files, and the search index is still empty for that course. Same behavior on Meilisearch and Typesense.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting! Maybe the course content was indexed one object at a time during the initial import of the demo course, so it did get mostly indexed.

That's probably a useful bug to raise on edx-platform then: make full reindexing more robust.

Comment on lines +55 to +61
# Pray that there are not datetime objects inside lists.
# If there are, they will be converted to str by the DocumentEncoder.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of the schema for courseware do we know in advance?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is just copied from the meilisearch implementation. It's not causing issues there so I left it here. I don't know why they didn't just make it recurse into lists though - it's very easy to fix this problem.

I believe we know the whole schema except the contents of the content object field which can vary.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we know this much of the schema, what do you think about declaring the full schema up front, to avoid the extra logic in create_indexes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only logic left in create_indexes is for handling the "extra" fields for datetime and nullable fields, and I think that's a sensible separation of concerns, so I'm inclined to leave it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok no worries. My thought was that we could declare the entire schema up front, declaratively, including the extra nullable and utf offset fields.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it, and I want to do that, but I want to do that as part of rewriting edx-search once we've decided which engines we'll support.

@bradenmacdonald
Copy link
Author

I think this step is something we can add to tutor-contrib-typesense. Tutor does it for meilisearch - see init/lms.sh.

Or can we make the backend here simply create the index itself on init, like the elasticsearch backend?

That's something I want to postpone until/unless we decide to merge Typesense support, and then yes we should do one of those things or even better make it a migration - see openedx/openedx-platform#36868

@mphilbrick211 mphilbrick211 moved this from Waiting on Author to In Eng Review in Contributions Sep 12, 2025
@mphilbrick211 mphilbrick211 moved this from In Eng Review to Waiting on Author in Contributions Sep 12, 2025
Copy link

@samuelallan72 samuelallan72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compared search results between the Meilisearch and Typesense backends (for both course overview and course content searches). My observations:

  • Currently the Typesense backend returns a lot less results than the Meilisearch backend. I think this is mostly related to only searching in the "content" field. If I add the "location" field, the more results are returned. I would argue that it's more accurate only searching the "content" field though, and that's a bug in the Meilisearch backend. Including the location field includes many results that don't seem very relevant to the search, just because the section containing the component has the keywords. Not sure how we want to handle that?
  • The Typesense backend doesn't appear to be using any word stemming - eg. for the demo course, a search for "captivate" with the Typesense backend only returns one result, while the same search with the Meilisearch backend returns many results, because it pulls in content containing "captivating" too. I think word stemming is going to be important here, otherwise highly relevant content is easily missed. Looks like it needs to be manually enabled per-field for Typesense https://typesense.org/docs/29.0/api/stemming.html - can we do this here for the relevant fields?
  • The search results order does differ between the backends - I guess this is expected, as both engines likely use different methods for ranking. Maybe just something we can keep an eye on and tweak if we notice the ranking seems too far off?

Otherwise in general everything seemed to work fine in testing. I think it's pretty close to being ready for use in benchmarking, but I would like to resolve the comments about differing search results above, to ensure it's a 1:1 comparison as much as possible.

Comment on lines +55 to +61
# Pray that there are not datetime objects inside lists.
# If there are, they will be converted to str by the DocumentEncoder.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we know this much of the schema, what do you think about declaring the full schema up front, to avoid the extra logic in create_indexes?

@samuelallan72
Copy link

samuelallan72 commented Sep 16, 2025

For reference, some notes on how I managed to get the demo course content to index successfully, and how I switched quickly between the backends:

tutor dev do importdemocourse
# fix the missing srt file that's breaking reindexing the demo course
tutor dev exec cms mkdir -p /openedx/media/video-transcripts/

# NOTE: this is nushell script
let missing_transcripts = [
  '058f8868e8264ac889ea22f287f665e2',
  '6480f116cdb54afa8272fe28fdd122eb',
  'ed47688e702e467e8d4eef40dd390825',
  '369f8e4bfa9d4369adcc24ba16932c2d',
  '841004754c0e4d00801984618249dfbe',
  'd644fb69d0414a71b6521dc25ae73bde',
]

for transcript in $missing_transcripts {
  tutor dev exec cms sh -c $'echo "1\n00:00:00,000 --> 00:00:02,500\nWelcome to the Example Subtitle File!" > /openedx/media/video-transcripts/($transcript).srt'
}

# reindex
tutor dev exec lms ./manage.py lms shell -c "import search.typesense; search.typesense.delete_indexes(); search.typesense.create_indexes()"
tutor dev exec cms ./manage.py cms reindex_course --active

To switch back to meilisearch for testing:

first add this to edx-search meilisearch.py:

# from Braden on https://github.com/openedx/edx-search/pull/222#discussion_r2350138484
def delete_indexes():
    """
    Delete the indexes
    """
    client = get_meilisearch_client()
    for index_name in INDEX_FILTERABLES.keys():
        meilisearch_index_name = get_meilisearch_index_name(index_name)
        print("Deleting ", meilisearch_index_name)
        try:
            index = client.get_index(meilisearch_index_name)
        except meilisearch.errors.MeilisearchApiError:
            continue
        task_info = index.delete()
        wait_for_task_to_succeed(client, task_info)

Then you can:

tutor plugins disable typesense
tutor dev exec lms ./manage.py lms shell -c "import search.meilisearch; search.meilisearch.delete_indexes(); search.meilisearch.create_indexes()"
tutor dev exec cms ./manage.py cms reindex_course --active

And to switch back again to typesense:

tutor plugins enable typesense
tutor dev exec lms ./manage.py lms shell -c "import search.typesense; search.typesense.delete_indexes(); search.typesense.create_indexes()"
tutor dev exec cms ./manage.py cms reindex_course --active

@bradenmacdonald
Copy link
Author

Currently the Typesense backend returns a lot less results than the Meilisearch backend. I think this is mostly related to only searching in the "content" field. If I add the "location" field, the more results are returned. I would argue that it's more accurate only searching the "content" field though, and that's a bug in the Meilisearch backend. Including the location field includes many results that don't seem very relevant to the search, just because the section containing the component has the keywords. Not sure how we want to handle that?

Yeah I agree. Today I had to run some unrelated tests with Elasticsearch and I noticed the same thing: Meilisearch returned way more results than Elasticsearch (or Typesense). So I too think there is a bug with how Meilisearch is configured/used. Since the Typesense as ES results seem similar and the extra meilisearch results are likely wrong, I think it's out of scope for this PR, but one of us should look into fixing it on the Meilisearch side if there's time.

BTW if you need to test Elasticsearch for any part of this project, I made it very easy: https://github.com/open-craft/tutor-contrib-elasticsearch-legacy

The search results order does differ between the backends - I guess this is expected, as both engines likely use different methods for ranking. Maybe just something we can keep an eye on and tweak if we notice the ranking seems too far off?

Sounds good.

some notes on how I managed to get the demo course content to index successfully

An even easier way is to go to http://studio.local.openedx.io:8001/admin/edxval/videotranscript/ and delete all the objects. To restore it (and get the broken video subtitles again), just run tutor dev do importdemocourse.

The Typesense backend doesn't appear to be using any word stemming - eg. for the demo course, a search for "captivate" with the Typesense backend only returns one result, while the same search with the Meilisearch backend returns many results, because it pulls in content containing "captivating" too. I think word stemming is going to be important here, otherwise highly relevant content is easily missed. Looks like it needs to be manually enabled per-field for Typesense https://typesense.org/docs/29.0/api/stemming.html - can we do this here for the relevant fields?

Thanks for pointing that out. Fixed in 0fc610f .

Comment on lines +81 to +87
# included in "content", which is possible (XBlocks can return
# whatever data they want from their `index_dictionary()` method).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we convert all fields in content to a string before indexing to avoid this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we encounter any errors when indexing content due to this, I'll add it, sure. Should be easy to add. But I'd prefer not to hard-code fixes for specific fields in the convert_doc_datatypes function if we don't have to.

@samuelallan72
Copy link

An even easier way is to go to http://studio.local.openedx.io:8001/admin/edxval/videotranscript/ and delete all the objects. To restore it (and get the broken video subtitles again), just run tutor dev do importdemocourse.

Ah nice, good to know thanks! :)

Thanks for adding stemming for the course content field. We may want stemming for the course and content fields for course overview too.

{"name": "language", "type": "string", "facet": True},
],
# Which fields to use for text matches. Required by Typesense.
"query_fields": ["course", "org", "number", "content"],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bradenmacdonald sorry, can we add stemming for the course and content fields here too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is the course name? I think the course name should have stemming but the course portion of the org+course+number code should not, as it's more of an ID then a text field.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad, yeah the course field is the course ID, while the content field is an object with fields that includes the overview and display name. Maybe we can add stemming for the content field, similarly to the content field in the other index?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done: 4efeab7

Copy link

@samuelallan72 samuelallan72 Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. :)

The search behaviour is a little strange though - a search for "captivate" or "captivating" here finds no results. A search for "captivat" on the other hand does find them.

1758500541

I think this is good for benchmarking now though. :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's a little hard to know what is covered by "stemming" and what is covered by "typo tolerance", and what is excluded due to short word length. All three are factors in any given thing. But I do think this should be more similar to Meilisearch now.

@samuelallan72
Copy link

@bradenmacdonald 👍 I'm happy with this for benchmarking purposes now, thanks!

  • I tested this: followed test instructions
  • I read through the code
  • I checked for accessibility issues
  • Includes documentation

Create the typesense index on an index request,
if it doesn't already exist.
This ensures that the typesense indexes are always auto-created
when they need to be.
@samuelallan72
Copy link

samuelallan72 commented Jan 20, 2026

@bradenmacdonald I added auto-creation of indexes. I implemented it by a method to check if they exist, and create if not, when there is a request to index documents.

I decided not to go the route of putting it in __init__ like the elasticsearch backend, because then this would cause an extra request to typesense every time a user makes a search. This way still has some wasted requests, but indexing is a slower background task, so it shouldn't have too much impact.

If/when we re-architecture edx-search, it would be good to add some init/startup method that can be called once to avoid wasted calls. Alternately, we could move this index creation to Tutor in the future, similar to what it does for the meilisearch index.

Please let me know what you think. :)

@bradenmacdonald
Copy link
Author

@samuelallan72 That sounds good to me - thanks! I'l test this out tomorrow.

@bradenmacdonald
Copy link
Author

@samuelallan72 I tested this again today with the automatic index creation (first I deleted all the data), and the API key creation and it's all working well. ✅

@bradenmacdonald
Copy link
Author

Since this is technically my PR, I can't review it. @samuelallan72 has already reviewed. @ormsbee would you be able to review too and give thumbs up so I can merge?

Note that one thing we didn't originally implement was tests for the new Typesense code, although the new utils code is covered. I forgot to account for this when estimating the additional work to merge this. If we get a bit more time, we can do that now or in a follow-up PR.

@mphilbrick211 mphilbrick211 requested a review from ormsbee January 23, 2026 13:40
@mphilbrick211 mphilbrick211 moved this from Waiting on Author to Ready for Review in Contributions Jan 23, 2026
Copy link

@ormsbee ormsbee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay. Thanks folks.

@samuelallan72
Copy link

thanks @ormsbee ! :)

I checked the failing test:

ERROR: test_aggregating_facet_narrowed_if_single_value_search (search.tests.test_course_discovery.TestMeilisearchCourseDiscoverySearch.test_aggregating_facet_narrowed_if_single_value_search)
...
meilisearch.errors.MeilisearchApiError: MeilisearchApiError. Error code: index_not_found. Error message: Index `test_index` not found. Error documentation: https://docs.meilisearch.com/errors#index_not_found Error type: invalid_request

It doesn't appear related to this PR, so perhaps it's just flaky. Would you mind retrying the CI? :)

@samuelallan72
Copy link

@bradenmacdonald also would you like me to squash the commits into one, or did you have thoughts for splitting into separate commits?

@bradenmacdonald
Copy link
Author

@samuelallan72 No, I prefer to squash on merge, and keep the full commit history in the GitHub PR. But please do rebase this or merge master into it, so that the tests run again, and we can get a green build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core contributor PR author is a Core Contributor (who may or may not have write access to this repo). FC Relates to an Axim Funded Contribution project open-source-contribution PR author is not from Axim or 2U

Projects

Status: Ready for Review

Development

Successfully merging this pull request may close these issues.

5 participants