Skip to content

REST service for sequence retrieval , reads subsequences from FASTA files using `htsjdk`

License

Notifications You must be signed in to change notification settings

VEuPathDB/service-sequence-retrieval

Repository files navigation

Sequence Retrieval Service

About This Service

This service is configured with a directory of reference sequences. It then receives requests containing coordinates of sequences, and replies with a .fasta.

Running Locally

The best (easiest) way to run locally is with docker compose. If you have a relatively recent version of Docker installed, this should work "out of the box" with a short wrapper script by using the following commands:

> cd docker-compose
> cp .env.sample .env
> ./runLocal.sh

This configuration will download and run the latest (tagged) image on docker hub and use the sample .fa and .fai files included in the repo.

Once up, you should be able to access the service locally on port 8080, e.g. GET http://localhost:8080/health

To point the service at your own files, change the FASTA_FILES_DIR environment variable in .env.

Example Queries

A number of queries, valid against the sample data, are provided and codified as curl commands wrapped in bash scripts. So view them and run one, execute (in a new terminal) e.g.

> ls -1 src/test/query
> bash src/test/query/query-genomic-bed.sh

Building Locally

To make and test changes to the service code, first try to build and run the service as-is. To do so, perform the following steps:

  1. Add your Github credentials into your environment.

  2. Build the code (note Java 17+ is required):

    > ./gradlew clean generate-jaxrs jar test
  3. If this succeeds, build a local docker image

    > ./gradlew build-docker
  4. Uncomment the line SEQUENCE_RETRIEVAL_IMAGE=sequence-retrieval in your .env file

  5. Run the service as above, i.e. cd docker-compose && ./runLocal.sh

Architecture Overview

Reference Files

For each reference, we provide a file with the sequences, as well as an index file describing where each sequence is in the file.

To index a fasta with the sequences called x.fa:

samtools faidx x.fa
scripts/index_to_sqlite3.sh x.fa.fai x.fa.fai.sqlite

We use SQLite because we may have so many sequences that the index is too large to keep in memory.

Query Format

The main query format is .bed. See the BED Wikipedia page for an overview.

Name column of .bed files should describe the sequences as they make sense to the user - for example, feature description. You can ask for exon sequence of a gene by using all 12 columns.

Dependencies

This service requires Java 17+ and Docker for local development.

The main library used for sequence lookups is htsjdk, which does a lot of work on the references and .bed features. It’s just a plain Java dependency; a few more methods have been copied and adapted.

About

REST service for sequence retrieval , reads subsequences from FASTA files using `htsjdk`

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 6