Skip to content

Conversation

@pverscha
Copy link
Member

@pverscha pverscha commented Apr 15, 2025

This PR changes the backend used for keeping track of all UniProt accessions from MySQL (MariaDB) to OpenSearch.

OpenSearch is a document-based search engine that's optimized for retrieving the documents that match with a set of text-based query-parameters, as fast as possible. I've configured OpenSearch in such a way that the UniProt accession number of proteins is used as the ID (which means that entries can still be retrieved instantly by ID), while simultaneously the other fields (name, taxon ID, accession number) are indexed in a smart way such that they can be found relatively fast.

This change from MySQL to OpenSearch was introduced because the FULLTEXT index in MySQL that we introduced earlier (for the name and uniprot_accession_number columns of proteins) is still not sufficiently performant for the use case we want.

I've added a new guide on the Unipept Wiki that explains how OpenSearch can be installed and deployed on our API-servers.

For reference, the whole index takes op 110GiB of disk space and works sufficiently well with 20GiB of RAM.

@pverscha pverscha added the enhancement New feature or request label Apr 15, 2025
@pverscha pverscha requested a review from tibvdm April 15, 2025 09:35
@pverscha pverscha self-assigned this Apr 15, 2025
@tibvdm tibvdm merged commit 128280c into master Apr 23, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants