Fast cython powered 1 hot encoding for DNA sequences.
$ pip install brisketimport numpy as np
from brisket import encode_seq
# Encode a DNA sequence to one-hot format
dna_sequence = "ATCG"
encoded = encode_seq(dna_sequence)
print(encoded)
# Output: 2D numpy array with shape (4, seq_length) - PyTorch convention
# [[1 0 0 0] # A channel: positions 0, 1, 2, 3
# [0 0 1 0] # C channel: positions 0, 1, 2, 3
# [0 0 0 1] # G channel: positions 0, 1, 2, 3
# [0 1 0 0]] # T channel: positions 0, 1, 2, 3
# The encoding uses channels-first format:
# - Row 0 = A channel, Row 1 = C channel, Row 2 = G channel, Row 3 = T channel
# - Each row represents one nucleotide type across all positions
# - Each column represents one position in the sequence
# Invalid characters (not A, T, C, G) result in all-zero columns
encoded_with_n = encode_seq("ATCGN") # Last column will be [0 0 0 0]Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.
brisket was created by Natalie Gill and Sean Whalen. It is licensed under the terms of the MIT license.
brisket was created with cookiecutter and the py-pkgs-cookiecutter template.