Migrate from MySQL to OpenSearch for proteins #76

pverscha · 2025-04-15T09:35:39Z

This PR changes the backend used for keeping track of all UniProt accessions from MySQL (MariaDB) to OpenSearch.

OpenSearch is a document-based search engine that's optimized for retrieving the documents that match with a set of text-based query-parameters, as fast as possible. I've configured OpenSearch in such a way that the UniProt accession number of proteins is used as the ID (which means that entries can still be retrieved instantly by ID), while simultaneously the other fields (name, taxon ID, accession number) are indexed in a smart way such that they can be found relatively fast.

This change from MySQL to OpenSearch was introduced because the FULLTEXT index in MySQL that we introduced earlier (for the name and uniprot_accession_number columns of proteins) is still not sufficiently performant for the use case we want.

I've added a new guide on the Unipept Wiki that explains how OpenSearch can be installed and deployed on our API-servers.

For reference, the whole index takes op 110GiB of disk space and works sufficiently well with 20GiB of RAM.

pverscha added 22 commits April 10, 2025 11:09

Add OpenSearch index for UniProt entries

dc9a80c

Merge branch 'master' into feature/opensearch-proteins

263c5e7

Implement logic to upload proteins to OpenSearch instance

e3da305

Update executable permissions of initialize_opensearch.sh

51debe8

Mark protein field as non-indexable

4af369d

Fix issue with counter

ee67736

Speed up tsv-to-json conversion

b7686bd

Fix jq command

605e5ac

Improve efficiency of upload command

47842e2

Send chunks of length 1000 to curl

cff937c

Split and use filter command

60d5ef4

Try with xargs

c5fc2b3

Try with xargs

e95b03f

Try with Python

75d8d26

Update PYthon code

a0d2549

updates

44d7aee

Upload in batches using Python

ffd92d4

Fixed tried cutting from binary file

e03d006

Add functional annotations to uniprot_entries index

02928fc

Use uniprot_accession_number as document ID

fa7260c

Wrong field name

8ffed51

Update index schema

ef5b97d

pverscha added the enhancement New feature or request label Apr 15, 2025

pverscha requested a review from tibvdm April 15, 2025 09:35

pverscha self-assigned this Apr 15, 2025

tibvdm approved these changes Apr 23, 2025

View reviewed changes

tibvdm merged commit 128280c into master Apr 23, 2025
6 checks passed

pverscha mentioned this pull request Apr 23, 2025

Migrate to OpenSearch for better protein filter performance unipept/unipept-api#75

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Migrate from MySQL to OpenSearch for proteins #76

Migrate from MySQL to OpenSearch for proteins #76

Uh oh!

pverscha commented Apr 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Migrate from MySQL to OpenSearch for proteins #76

Migrate from MySQL to OpenSearch for proteins #76

Uh oh!

Conversation

pverscha commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pverscha commented Apr 15, 2025 •

edited

Loading