Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
209 changes: 209 additions & 0 deletions GAD2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
import os
import re
from typing import TextIO, Optional, Union
from abc import ABC, abstractmethod
from Bio import SeqIO, SeqUtils
Comment on lines +1 to +5

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Правильно остортировано по происхождению бибилиотек, но нет лексикографического порядка.

Suggested change
import os
import re
from typing import TextIO, Optional, Union
from abc import ABC, abstractmethod
from Bio import SeqIO, SeqUtils
import os
import re
from abc import ABC, abstractmethod
from typing import TextIO, Optional, Union
from Bio import SeqIO, SeqUtils


class BiologicalSequence(ABC):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Отмечу здесь, потому что касается всех классов для последовательностей.

Нет аннотации типов.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Насколько я разобрался: абстрактный класс написан прям как надо

@abstractmethod
def __len__():
pass

def __getitem__():
pass

def __str__():
pass

def check_sequence():
pass

class NucleicAcidSequence(BiologicalSequence):
"""Proccising nucleic acid sequences.
This class is parental for DNASequence and RNASequence classes.
Comment on lines +22 to +23

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Докстринги - замечательно.

"""
def __init__(self, seq) -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Обратил внимание: у классов аннотация типов только для ретёрна у инитов. Кажется, инитам она как раз не очень нужна. И точно для аргументов можно добавить.

Suggested change
def __init__(self, seq) -> None:
def __init__(self, seq: str) -> None:

self.seq = seq
Comment on lines +25 to +26

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Я не очень понимаю зачем тут писать "-> None" 🤔 хотя сам концепт мне нравится (первый раз с ним сталкиваюсь и возьму себе на вооружение)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__init__ должен возвращать None по определению, иначе питон бросит ошибку, поэтому да, в случае с __init__ так делать не надо. Но когда просто функция кастомная, то там такая аннотация будет полезна


def check_sequence(self):
"""
Checkes the correctnessis of input string (DNA or RNA).
"""
for nucleotide in self.seq:
if not nucleotide in type(self).complement_rule:
raise ValueError (f'Incorrect nucleotide in {self.seq}')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Функцию проверки лучше делать только True/False, а не кидать ошибку вместо False. Мы же делаем проверку специально, и для ошибочного сиквенса ожидаем от программы не ошибки, а именно ответа, что сиквенс неверный.

Suggested change
raise ValueError (f'Incorrect nucleotide in {self.seq}')
return False

return True

def complement(self):
"""
Create complement sequences of RNA or DNA acoording to the complementary base pairing rule.
"""
if self.check_sequence():
complement_sequence = ''
for nucleotide in self.seq:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Такую проверку можно сделать не в цикле, а просто проверив подмножество:

if set(self.seq).issubset(set(complement_rule)):

if nucleotide in type(self).complement_rule:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Здесь у экземпляра класса вызывается атрибут complement_rule, но в самом классе NucleicAcidSequence его нет. Кажется, это не очень корректно.
Предлагаю такой вариант: в NucleicAcidSequence ставим всем подобным атрибутам = None. Тогда проверку для этих атрибутов в методах делаем такую:

if self.complement_rule is None:
	raise NotImplementedError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Когда пишите код в комментах PR, добавляйте python после ``` - так синтаксис будет подсвечен как надо

complement_sequence += type(self).complement_rule[nucleotide]
return type(self)(complement_sequence)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Возвращает тот же класс - отлично!


def gc_content(self):
return (sum(1 for _ in re.finditer(r'[GCgc]', self.seq)))/self.__len__()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Оригинально! Ещё одно свидетельство того, как многогранен Пайтон.
Можно ещё такой вариант:

(self.seq.upper().count('G') + self.seq.upper().count('C')) / len(self.seq)

Suggested change
return (sum(1 for _ in re.finditer(r'[GCgc]', self.seq)))/self.__len__()
return (sum(1 for _ in re.finditer(r'[GCgc]', self.seq))) / self.__len__()

Comment on lines +48 to +49

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ничего плохого в этом способе нет, однако выглядит немного неинтуитивно что ли, сложно читат 🤧


def __len__(self):
return len(self.seq)

def __getitem__(self, key):
return self.seq[key]

def __str__(self):
return self.seq
Comment on lines +57 to +58

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Всё ок.
Ещё можно дописать __repr__, чтоб Юпитер совсем красиво печатал без print():

def __repr__(self):
    return type(self)(self.sequence)


class DNASequence(NucleicAcidSequence):
"""
Procissing DNA sequences.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Procissing DNA sequences.
Processing DNA sequences.

Argumemt: seq (str) - DNA sequence, letters case do not matter.
-Example nucl_seq = DNASequence('ATGCGC' OR 'ATgCgc' OR 'atgcgc).

Valid operations:
- transcribe - return transcibed sequence of DNA coding strand;
- complement - return complementary sequence according to complementary rule
- check sequence - is input sequence correct
- gc_content - return percent of GC in sequence

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Только в текущей реализации не %, а долю

Suggested change
- gc_content - return percent of GC in sequence
- gc_content - return GC-fraction of sequence

"""
complement_rule = {'a': 't', 'A': 'T', 't': 'a', 'T': 'A',
'g': 'c', 'G': 'C', 'c': 'g', 'C': 'G'}
Comment on lines +72 to +73

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Здорово, что переменная вынесена до объявления всех методов. Она для всего класса и не предполагается, что она будет изменяться у экземпляров. Лучше записать заглавными буквами.

Comment on lines +72 to +73

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Правильно, что учтены и прописные, и строчные


def __init__(self, seq: str) -> None:
super().__init__(seq = seq)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
super().__init__(seq = seq)
super().__init__(seq)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Возможно я чего-то не понимаю, но опять же не понятно зачем писать (seq = seq) 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

В целом явное лучше чем не явное, поэтому прописывать в какие аргументы какие значения мы передаем - звучит полезно


def transcribe(self):
"""
Return transcribed sequence of DNA acoording to the complementary base pairing rule.
Function can procced only DNA seqences.
"""
transcribed_sequence = ''
transcribed_sequence = self.seq.replace('t', 'u').replace('T', 'U')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Нет проверки на корректность. Если в сиквенсе будет буква, которой нет в алфавите, то она появится и в новой последовательности.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ну в целом проверка это не зона отвественности транскрайба

return type(self)(transcribed_sequence)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Вот тут можно сделать. Тогда транскрибированная ДНК превратится в РНК, и будет вообще круто.

Suggested change
return type(self)(transcribed_sequence)
return RNASequence(transcribed_sequence)


class RNASequence(NucleicAcidSequence):
"""
Procissing DNA sequences.
Argumemt: seq (str) - RNA sequence, letters case do not matter.
-Example nucl_seq = RNASequence('AUGCGC' OR 'AUgCgc' OR 'augcgc).

Valid operations:
- complement - return complementary sequence according to complementary rule
- check sequence - is input sequence correct
- gc_content - return percent of GC in sequence
"""
complement_rule = {'a': 'u', 'A': 'U', 'u': 'a', 'U': 'A',
'g': 'c', 'G': 'C', 'c': 'g', 'C': 'G'}

def __init__(self, seq) -> None:
super().__init__(seq = seq)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
super().__init__(seq = seq)
super().__init__(seq)


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Нет класса для белковых последовательностей 😿


def check_gc(fastq_read: str, gc_params: tuple) -> bool:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Здесь и дальше аннотация типов - отлично.

"""
Filters sequences in FASTQ file by GC percentage.

Input:
- fastq_read (str): FASTQ sequence read.
- gc_params (tuple): range for filtration, accepts upper or upper/lower border. Both borders are included.

Output:
Returns boolean value is this read satisfied the criteria.
"""
gc_result = SeqUtils.GC123(fastq_read)[0]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

А в чём суть именно GC123()? Я даже так и не понял, что она считает в остальных результатах %)

Suggested change
gc_result = SeqUtils.GC123(fastq_read)[0]
gc_result = SeqUtils.GC(fastq_read)

if gc_params[0] < gc_result < gc_params[1]:
return True


def check_length(fastq_read: str, len_bound_params: tuple) -> bool:
"""
Filters sequences in FASTQ file by sequence length.

Input:
- fastq_read (str): FASTQ sequence read.
- len_bound_params (tuple): range for filtration, accepts upper or upper/lower border. Both borders are included.

Output:
Returns boolean value is this read satisfied the criteria.
"""
len_of_seq = len(fastq_read)
if len_bound_params[0] < len_of_seq < len_bound_params[1]:
return True


def check_quality(fastq_quality: str, quality_params: int) -> bool:
"""
Filters sequences in FASTQ file by mean quality score.

Input:
- fastq_quality (str): FASTQ read quality
- quality_params (int): threshold value of reads quality in phred33 scale.

Output:
Returns boolean value is this read satisfied the criteria.
"""
mean_quality = sum(fastq_quality.letter_annotations["phred_quality"])/len(fastq_quality.seq)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mean_quality = sum(fastq_quality.letter_annotations["phred_quality"])/len(fastq_quality.seq)
mean_quality = sum(fastq_quality.letter_annotations["phred_quality"]) / len(fastq_quality.seq)

if mean_quality >= quality_params:
return True


def int_to_tuple(input_parameters) -> tuple:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Вынести это в отдельную функцию - очень правильное решение, по-моему, упрощает логику кода.

"""
Converts input parameters to tuple format.
If input is already a tuple, it will be return without changes.
If input parameter is int (only upper threshold is entered), function will return a tuple like (0, 'input').

Input:
- input_parameters (tuple or int).

Output:
Lower and upper threshold for functions in tuple format.
"""
if isinstance(input_parameters, tuple):
return input_parameters
return (0, input_parameters)


def run_fastq_filter(input_path: Optional[str] = None, output_filename: Optional[str] = None, gc_bounds: Union[int, tuple] = (0, 100), length_bounds: Union[int, tuple] = (0, 2**32), quality_threshold: int = 0) -> TextIO:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

А зачем дефолтный None для input_path? Всё равно упадёт с ошибкой при попытке прочитать None, но если у необходимого атрибута не будет дефолтного значения и его забудут указать, ошибка будет другая, понятнее.

Suggested change
def run_fastq_filter(input_path: Optional[str] = None, output_filename: Optional[str] = None, gc_bounds: Union[int, tuple] = (0, 100), length_bounds: Union[int, tuple] = (0, 2**32), quality_threshold: int = 0) -> TextIO:
def run_fastq_filter(input_path: Optional[str], output_filename: Optional[str] = None, gc_bounds: Union[int, tuple] = (0, 100), length_bounds: Union[int, tuple] = (0, 2**32), quality_threshold: int = 0) -> TextIO:

"""
FASTQ Filtraror with BIOPYTHON utilities
Performs filter of input FASTQ file according to input parameters.
Input will be filtered by:
- GC content (gc_bounds);
- reads length (length_bounds);
- reads quality score (quality_threshold).

Input:
- input_path (str): path to .fastq file; include 4 strings: 1 - read ID, 2 - sequence, 3 - comment, 4 - quality. Default - None.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Кажется, описание спецификации fastq-формата уже лишнее)

Suggested change
- input_path (str): path to .fastq file; include 4 strings: 1 - read ID, 2 - sequence, 3 - comment, 4 - quality. Default - None.
- input_path (str): path to .fastq file.

- output_filename (str): name of output file, by default, it will be saved in the directory 'fastq_filtrator_resuls'. Default name will be name of input file.
- gc_bounds (tuple or int): GC content filter parameters, it accepts lower and upper (tuple), or only upper threshold value (int). Default value (0, 100).
- length_bounds (tuple or int): read length filter parameters, it accepts lower and upper (tuple), or only upper threshold value (int). Default value (0, 2**32).
- quality_threshold (int): upper quality threshold in phred33 scale. Reads with average quality below the threshold are discarded. Default value - 0.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Только не верхняя граница, а нижняя.

Suggested change
- quality_threshold (int): upper quality threshold in phred33 scale. Reads with average quality below the threshold are discarded. Default value - 0.
- quality_threshold (int): lower quality threshold in phred33 scale. Reads with average quality below the threshold are discarded. Default value - 0.


Output:
Returns FASTQ only with filtered reads which satisfied all input/default conditions.
"""

"Specify input path"
if not input_path.endswith('.fastq'):
raise ValueError('Incorrect input file extension, should be .fastq')
Comment on lines +191 to +192

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Здорово, что есть контроль входного файла!


"Specify output path"
output_path = 'fastq_filtrator_resuls'
if not os.path.exists(output_path):
os.makedirs(output_path)
if output_filename is None:
output_filename = os.path.basename(input_path)
"Passed the parameters"
gc_params = int_to_tuple(gc_bounds)
len_bound_params = int_to_tuple(length_bounds)
"Filter and record results"
filtererd_fastq = open(os.path.join(output_path, output_filename), mode='w')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Работать с файлами через with open надёжнее. Если, например, произойдёт ошибка перед filtererd_fastq.close(), файл не закроется.

for seq_record in SeqIO.parse(input_path, "fastq"):
if check_gc(seq_record.seq, gc_params) and check_length(seq_record.seq, len_bound_params) and check_quality(seq_record, quality_threshold):
SeqIO.write(seq_record, filtererd_fastq, "fastq")
filtererd_fastq.close()
Comment on lines +204 to +208

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Лучше использовать контекстный менеджер с оператором with, чтобы точно знать, что все закроется и не произойдет непредвиденного

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Даже в этом случае, согласен

return filtererd_fastq