-
Notifications
You must be signed in to change notification settings - Fork 0
Review GAD2 #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Review GAD2 #24
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,209 @@ | ||||||
| import os | ||||||
| import re | ||||||
| from typing import TextIO, Optional, Union | ||||||
| from abc import ABC, abstractmethod | ||||||
| from Bio import SeqIO, SeqUtils | ||||||
|
|
||||||
| class BiologicalSequence(ABC): | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Нет аннотации типов. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Насколько я разобрался: абстрактный класс написан прям как надо |
||||||
| @abstractmethod | ||||||
| def __len__(): | ||||||
| pass | ||||||
|
|
||||||
| def __getitem__(): | ||||||
| pass | ||||||
|
|
||||||
| def __str__(): | ||||||
| pass | ||||||
|
|
||||||
| def check_sequence(): | ||||||
| pass | ||||||
|
|
||||||
| class NucleicAcidSequence(BiologicalSequence): | ||||||
| """Proccising nucleic acid sequences. | ||||||
| This class is parental for DNASequence and RNASequence classes. | ||||||
|
Comment on lines
+22
to
+23
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Докстринги - замечательно. |
||||||
| """ | ||||||
| def __init__(self, seq) -> None: | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Обратил внимание: у классов аннотация типов только для ретёрна у инитов. Кажется, инитам она как раз не очень нужна. И точно для аргументов можно добавить.
Suggested change
|
||||||
| self.seq = seq | ||||||
|
Comment on lines
+25
to
+26
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Я не очень понимаю зачем тут писать "-> None" 🤔 хотя сам концепт мне нравится (первый раз с ним сталкиваюсь и возьму себе на вооружение)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||
|
|
||||||
| def check_sequence(self): | ||||||
| """ | ||||||
| Checkes the correctnessis of input string (DNA or RNA). | ||||||
| """ | ||||||
| for nucleotide in self.seq: | ||||||
| if not nucleotide in type(self).complement_rule: | ||||||
| raise ValueError (f'Incorrect nucleotide in {self.seq}') | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Функцию проверки лучше делать только
Suggested change
|
||||||
| return True | ||||||
|
|
||||||
| def complement(self): | ||||||
| """ | ||||||
| Create complement sequences of RNA or DNA acoording to the complementary base pairing rule. | ||||||
| """ | ||||||
| if self.check_sequence(): | ||||||
| complement_sequence = '' | ||||||
| for nucleotide in self.seq: | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Такую проверку можно сделать не в цикле, а просто проверив подмножество:
|
||||||
| if nucleotide in type(self).complement_rule: | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Здесь у экземпляра класса вызывается атрибут complement_rule, но в самом классе
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Когда пишите код в комментах PR, добавляйте |
||||||
| complement_sequence += type(self).complement_rule[nucleotide] | ||||||
| return type(self)(complement_sequence) | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Возвращает тот же класс - отлично! |
||||||
|
|
||||||
| def gc_content(self): | ||||||
| return (sum(1 for _ in re.finditer(r'[GCgc]', self.seq)))/self.__len__() | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Оригинально! Ещё одно свидетельство того, как многогранен Пайтон.
Suggested change
Comment on lines
+48
to
+49
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ничего плохого в этом способе нет, однако выглядит немного неинтуитивно что ли, сложно читат 🤧 |
||||||
|
|
||||||
| def __len__(self): | ||||||
| return len(self.seq) | ||||||
|
|
||||||
| def __getitem__(self, key): | ||||||
| return self.seq[key] | ||||||
|
|
||||||
| def __str__(self): | ||||||
| return self.seq | ||||||
|
Comment on lines
+57
to
+58
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Всё ок. |
||||||
|
|
||||||
| class DNASequence(NucleicAcidSequence): | ||||||
| """ | ||||||
| Procissing DNA sequences. | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| Argumemt: seq (str) - DNA sequence, letters case do not matter. | ||||||
| -Example nucl_seq = DNASequence('ATGCGC' OR 'ATgCgc' OR 'atgcgc). | ||||||
|
|
||||||
| Valid operations: | ||||||
| - transcribe - return transcibed sequence of DNA coding strand; | ||||||
| - complement - return complementary sequence according to complementary rule | ||||||
| - check sequence - is input sequence correct | ||||||
| - gc_content - return percent of GC in sequence | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Только в текущей реализации не %, а долю
Suggested change
|
||||||
| """ | ||||||
| complement_rule = {'a': 't', 'A': 'T', 't': 'a', 'T': 'A', | ||||||
| 'g': 'c', 'G': 'C', 'c': 'g', 'C': 'G'} | ||||||
|
Comment on lines
+72
to
+73
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Здорово, что переменная вынесена до объявления всех методов. Она для всего класса и не предполагается, что она будет изменяться у экземпляров. Лучше записать заглавными буквами.
Comment on lines
+72
to
+73
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Правильно, что учтены и прописные, и строчные |
||||||
|
|
||||||
| def __init__(self, seq: str) -> None: | ||||||
| super().__init__(seq = seq) | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Возможно я чего-то не понимаю, но опять же не понятно зачем писать (seq = seq) 🤔
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. В целом явное лучше чем не явное, поэтому прописывать в какие аргументы какие значения мы передаем - звучит полезно |
||||||
|
|
||||||
| def transcribe(self): | ||||||
| """ | ||||||
| Return transcribed sequence of DNA acoording to the complementary base pairing rule. | ||||||
| Function can procced only DNA seqences. | ||||||
| """ | ||||||
| transcribed_sequence = '' | ||||||
| transcribed_sequence = self.seq.replace('t', 'u').replace('T', 'U') | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Нет проверки на корректность. Если в сиквенсе будет буква, которой нет в алфавите, то она появится и в новой последовательности.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ну в целом проверка это не зона отвественности транскрайба |
||||||
| return type(self)(transcribed_sequence) | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Вот тут можно сделать. Тогда транскрибированная ДНК превратится в РНК, и будет вообще круто.
Suggested change
|
||||||
|
|
||||||
| class RNASequence(NucleicAcidSequence): | ||||||
| """ | ||||||
| Procissing DNA sequences. | ||||||
| Argumemt: seq (str) - RNA sequence, letters case do not matter. | ||||||
| -Example nucl_seq = RNASequence('AUGCGC' OR 'AUgCgc' OR 'augcgc). | ||||||
|
|
||||||
| Valid operations: | ||||||
| - complement - return complementary sequence according to complementary rule | ||||||
| - check sequence - is input sequence correct | ||||||
| - gc_content - return percent of GC in sequence | ||||||
| """ | ||||||
| complement_rule = {'a': 'u', 'A': 'U', 'u': 'a', 'U': 'A', | ||||||
| 'g': 'c', 'G': 'C', 'c': 'g', 'C': 'G'} | ||||||
|
|
||||||
| def __init__(self, seq) -> None: | ||||||
| super().__init__(seq = seq) | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Нет класса для белковых последовательностей 😿 |
||||||
|
|
||||||
| def check_gc(fastq_read: str, gc_params: tuple) -> bool: | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Здесь и дальше аннотация типов - отлично. |
||||||
| """ | ||||||
| Filters sequences in FASTQ file by GC percentage. | ||||||
|
|
||||||
| Input: | ||||||
| - fastq_read (str): FASTQ sequence read. | ||||||
| - gc_params (tuple): range for filtration, accepts upper or upper/lower border. Both borders are included. | ||||||
|
|
||||||
| Output: | ||||||
| Returns boolean value is this read satisfied the criteria. | ||||||
| """ | ||||||
| gc_result = SeqUtils.GC123(fastq_read)[0] | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. А в чём суть именно GC123()? Я даже так и не понял, что она считает в остальных результатах %)
Suggested change
|
||||||
| if gc_params[0] < gc_result < gc_params[1]: | ||||||
| return True | ||||||
|
|
||||||
|
|
||||||
| def check_length(fastq_read: str, len_bound_params: tuple) -> bool: | ||||||
| """ | ||||||
| Filters sequences in FASTQ file by sequence length. | ||||||
|
|
||||||
| Input: | ||||||
| - fastq_read (str): FASTQ sequence read. | ||||||
| - len_bound_params (tuple): range for filtration, accepts upper or upper/lower border. Both borders are included. | ||||||
|
|
||||||
| Output: | ||||||
| Returns boolean value is this read satisfied the criteria. | ||||||
| """ | ||||||
| len_of_seq = len(fastq_read) | ||||||
| if len_bound_params[0] < len_of_seq < len_bound_params[1]: | ||||||
| return True | ||||||
|
|
||||||
|
|
||||||
| def check_quality(fastq_quality: str, quality_params: int) -> bool: | ||||||
| """ | ||||||
| Filters sequences in FASTQ file by mean quality score. | ||||||
|
|
||||||
| Input: | ||||||
| - fastq_quality (str): FASTQ read quality | ||||||
| - quality_params (int): threshold value of reads quality in phred33 scale. | ||||||
|
|
||||||
| Output: | ||||||
| Returns boolean value is this read satisfied the criteria. | ||||||
| """ | ||||||
| mean_quality = sum(fastq_quality.letter_annotations["phred_quality"])/len(fastq_quality.seq) | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| if mean_quality >= quality_params: | ||||||
| return True | ||||||
|
|
||||||
|
|
||||||
| def int_to_tuple(input_parameters) -> tuple: | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Вынести это в отдельную функцию - очень правильное решение, по-моему, упрощает логику кода. |
||||||
| """ | ||||||
| Converts input parameters to tuple format. | ||||||
| If input is already a tuple, it will be return without changes. | ||||||
| If input parameter is int (only upper threshold is entered), function will return a tuple like (0, 'input'). | ||||||
|
|
||||||
| Input: | ||||||
| - input_parameters (tuple or int). | ||||||
|
|
||||||
| Output: | ||||||
| Lower and upper threshold for functions in tuple format. | ||||||
| """ | ||||||
| if isinstance(input_parameters, tuple): | ||||||
| return input_parameters | ||||||
| return (0, input_parameters) | ||||||
|
|
||||||
|
|
||||||
| def run_fastq_filter(input_path: Optional[str] = None, output_filename: Optional[str] = None, gc_bounds: Union[int, tuple] = (0, 100), length_bounds: Union[int, tuple] = (0, 2**32), quality_threshold: int = 0) -> TextIO: | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. А зачем дефолтный None для input_path? Всё равно упадёт с ошибкой при попытке прочитать None, но если у необходимого атрибута не будет дефолтного значения и его забудут указать, ошибка будет другая, понятнее.
Suggested change
|
||||||
| """ | ||||||
| FASTQ Filtraror with BIOPYTHON utilities | ||||||
| Performs filter of input FASTQ file according to input parameters. | ||||||
| Input will be filtered by: | ||||||
| - GC content (gc_bounds); | ||||||
| - reads length (length_bounds); | ||||||
| - reads quality score (quality_threshold). | ||||||
|
|
||||||
| Input: | ||||||
| - input_path (str): path to .fastq file; include 4 strings: 1 - read ID, 2 - sequence, 3 - comment, 4 - quality. Default - None. | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Кажется, описание спецификации fastq-формата уже лишнее)
Suggested change
|
||||||
| - output_filename (str): name of output file, by default, it will be saved in the directory 'fastq_filtrator_resuls'. Default name will be name of input file. | ||||||
| - gc_bounds (tuple or int): GC content filter parameters, it accepts lower and upper (tuple), or only upper threshold value (int). Default value (0, 100). | ||||||
| - length_bounds (tuple or int): read length filter parameters, it accepts lower and upper (tuple), or only upper threshold value (int). Default value (0, 2**32). | ||||||
| - quality_threshold (int): upper quality threshold in phred33 scale. Reads with average quality below the threshold are discarded. Default value - 0. | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Только не верхняя граница, а нижняя.
Suggested change
|
||||||
|
|
||||||
| Output: | ||||||
| Returns FASTQ only with filtered reads which satisfied all input/default conditions. | ||||||
| """ | ||||||
|
|
||||||
| "Specify input path" | ||||||
| if not input_path.endswith('.fastq'): | ||||||
| raise ValueError('Incorrect input file extension, should be .fastq') | ||||||
|
Comment on lines
+191
to
+192
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Здорово, что есть контроль входного файла! |
||||||
|
|
||||||
| "Specify output path" | ||||||
| output_path = 'fastq_filtrator_resuls' | ||||||
| if not os.path.exists(output_path): | ||||||
| os.makedirs(output_path) | ||||||
| if output_filename is None: | ||||||
| output_filename = os.path.basename(input_path) | ||||||
| "Passed the parameters" | ||||||
| gc_params = int_to_tuple(gc_bounds) | ||||||
| len_bound_params = int_to_tuple(length_bounds) | ||||||
| "Filter and record results" | ||||||
| filtererd_fastq = open(os.path.join(output_path, output_filename), mode='w') | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Работать с файлами через |
||||||
| for seq_record in SeqIO.parse(input_path, "fastq"): | ||||||
| if check_gc(seq_record.seq, gc_params) and check_length(seq_record.seq, len_bound_params) and check_quality(seq_record, quality_threshold): | ||||||
| SeqIO.write(seq_record, filtererd_fastq, "fastq") | ||||||
| filtererd_fastq.close() | ||||||
|
Comment on lines
+204
to
+208
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Лучше использовать контекстный менеджер с оператором with, чтобы точно знать, что все закроется и не произойдет непредвиденного
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Даже в этом случае, согласен |
||||||
| return filtererd_fastq | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Правильно остортировано по происхождению бибилиотек, но нет лексикографического порядка.