Skip to content

gladstone-institutes/brisket

Repository files navigation

brisket

Fast cython powered 1 hot encoding for DNA sequences.

Installation

$ pip install brisket

Usage

import numpy as np
from brisket import encode_seq

# Encode a DNA sequence to one-hot format
dna_sequence = "ATCG"
encoded = encode_seq(dna_sequence)

print(encoded)
# Output: 2D numpy array with shape (4, seq_length) - PyTorch convention
# [[1 0 0 0]  # A channel: positions 0, 1, 2, 3
#  [0 0 1 0]  # C channel: positions 0, 1, 2, 3
#  [0 0 0 1]  # G channel: positions 0, 1, 2, 3
#  [0 1 0 0]] # T channel: positions 0, 1, 2, 3

# The encoding uses channels-first format:
# - Row 0 = A channel, Row 1 = C channel, Row 2 = G channel, Row 3 = T channel
# - Each row represents one nucleotide type across all positions
# - Each column represents one position in the sequence

# Invalid characters (not A, T, C, G) result in all-zero columns
encoded_with_n = encode_seq("ATCGN")  # Last column will be [0 0 0 0]

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

License

brisket was created by Natalie Gill and Sean Whalen. It is licensed under the terms of the MIT license.

Credits

brisket was created with cookiecutter and the py-pkgs-cookiecutter template.

About

Cython based fast one hot encoding for DNA sequences

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published