-
Notifications
You must be signed in to change notification settings - Fork 0
Review RGL1 #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Review RGL1 #36
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,142 @@ | ||||||||||||
| from typing import Union | ||||||||||||
| from Bio import SeqIO, SeqUtils | ||||||||||||
| from abc import ABC, abstractmethod | ||||||||||||
|
|
||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Здорово, что используется Union, это упрощает понимание кода |
||||||||||||
|
|
||||||||||||
| def length_filter(record, lower_length_bound: int, upper_length_bound: int): | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Немножко не хватило аннотации, как-то было не очевидно сразу, что он может выдавать и record, и None |
||||||||||||
| if lower_length_bound <= len(record.seq) <= upper_length_bound: | ||||||||||||
| return record | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Немного избыточно, можно либо True, либо False |
||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| def gc_filter(record, lower_gc_bound: float, upper_gc_bound: float): | ||||||||||||
| if lower_gc_bound <= SeqUtils.GC(record.seq) <= upper_gc_bound: | ||||||||||||
| return record | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| def quality_filter(record, quality_threshold: Union[int, float] = 0): | ||||||||||||
| if (quality_threshold < sum(record.letter_annotations['phred_quality']) / | ||||||||||||
| len(record.letter_annotations['phred_quality'])): | ||||||||||||
| return record | ||||||||||||
|
|
||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Все три функции выше возвращают record или None. Это подход работает, но его можно упростить, изменив функции так, чтобы они возвращали True или False, что будет более понятно для проверки условий |
||||||||||||
|
|
||||||||||||
| def filter_fastq(file_path: str, output_file: str = 'input_fasta_name_filter.fastq', | ||||||||||||
| lower_gc_bound: int = 0, upper_gc_bound: int = 30, | ||||||||||||
| lower_length_bound: int = 0, upper_length_bound: int = 2**32, | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Тут тоже можно было написать с помощью Union
Comment on lines
+23
to
+24
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Возможно, стоило бы вынести эти параметры как константы вне функции |
||||||||||||
| quality_threshold: Union[int, float] = 0) -> None: | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Хорошее использование модулей There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Like! |
||||||||||||
| records = SeqIO.parse(file_path, 'fastq') | ||||||||||||
| if output_file == 'input_fasta_name_filter.fastq': | ||||||||||||
| output_filename = file_path.split('\\')[-1] | ||||||||||||
| output_file = output_filename.replace('.fastq', '_filter.fastq') | ||||||||||||
| with open(output_file, 'w') as output: | ||||||||||||
| for record in records: | ||||||||||||
| if length_filter(record, lower_length_bound, upper_length_bound) and \ | ||||||||||||
| gc_filter(record, lower_gc_bound, upper_gc_bound) and \ | ||||||||||||
| quality_filter(record, quality_threshold): | ||||||||||||
| SeqIO.write(record, output, 'fastq') | ||||||||||||
|
|
||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Повторная запись в файл для каждой подходящей записи через SeqIO.write(record, output, 'fastq') в цикле неэффективна, так как каждая запись обрабатывается отдельно. Лучше собрать подходящие записи в список, а затем записать их одним вызовом SeqIO.write(filtered_records, output, 'fastq')
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. В данном случае есть 2 варианта
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Нет проверки корректности диапазонов для lower_gc_bound, upper_gc_bound, lower_length_bound, upper_length_bound и quality_threshold |
||||||||||||
|
|
||||||||||||
| class BiologicalSequence(ABC): | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Реализованы все необходимые абстрактные методы |
||||||||||||
| @abstractmethod | ||||||||||||
| def __len__(self): | ||||||||||||
| pass | ||||||||||||
|
|
||||||||||||
| @abstractmethod | ||||||||||||
| def __getitem__(self, item: int): | ||||||||||||
| pass | ||||||||||||
|
|
||||||||||||
| @abstractmethod | ||||||||||||
| def __str__(self): | ||||||||||||
| pass | ||||||||||||
|
|
||||||||||||
| @abstractmethod | ||||||||||||
| def check_alphabet(self): | ||||||||||||
| pass | ||||||||||||
|
|
||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Комментарий для кода ниже: в DNASequence, RNASequence, и других подклассах используются методы check_alphabet и complement, которые просто вызывают реализацию из этого суперкласса. Это избыточно, так как наследование само по себе обеспечивает доступ к этим методам, если они не переопределены в подклассе |
||||||||||||
|
|
||||||||||||
| class SequenceFunction(BiologicalSequence): | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Крутое решение с введением вспомогательного класса |
||||||||||||
| alphabet = () | ||||||||||||
|
|
||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Стоило бы настроить алфавит |
||||||||||||
| def __init__(self, seq: str) -> None: | ||||||||||||
| self.seq = seq | ||||||||||||
|
|
||||||||||||
| def __len__(self) -> int: | ||||||||||||
| self.length = len(self.seq) | ||||||||||||
| return self.length | ||||||||||||
|
|
||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. В методе len класса устанавливается атрибут self.length, что является избыточным. Метод len предназначен только для возврата длины, и ему не нужно сохранять это значение в атрибуте объекта There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Обрати внимание, когда ты пишешь |
||||||||||||
| def __getitem__(self, item: int) -> str: | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Работающее решение, но было бы здорово, если бы пользователь мог получить не только отдельный элемент, но целый срез |
||||||||||||
| if 0 <= item < len(self.seq): | ||||||||||||
| return self.seq[item] | ||||||||||||
| else: | ||||||||||||
| raise IndexError('Your index is incorrect') | ||||||||||||
|
Comment on lines
+66
to
+70
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Немножко странно, что ошибка по индексу будет вылезать при отрицательных значениях индекса: это же ок для питона, мы можем обращаться к [-1] и тд |
||||||||||||
|
|
||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Это избыточно, Python сам генерирует IndexError для индексов вне диапазона
Suggested change
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Верно |
||||||||||||
| def __str__(self) -> str: | ||||||||||||
| return str(self.seq) | ||||||||||||
|
|
||||||||||||
| def check_alphabet(self) -> bool: | ||||||||||||
| return set(self.seq).issubset(set(self.alphabet)) | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| class NucleicAcidSequence(SequenceFunction): | ||||||||||||
| complement_dict = {} | ||||||||||||
| def __init__(self, seq: str): | ||||||||||||
| super().__init__(seq) | ||||||||||||
|
|
||||||||||||
| def complement(self) -> str: | ||||||||||||
| complement_seq = self.seq.translate(str.maketrans(self.complement_dict)) | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Интересное решение! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Прикольно с maketrans:)) не знала о существовании этой функции |
||||||||||||
| return complement_seq | ||||||||||||
|
|
||||||||||||
| def gc_content(self) -> Union[int, float]: | ||||||||||||
| gc_count = self.seq.count('C') + self.seq.count('G') | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Можно попробовать так, как разбирали недавно на паре. Мне кажется, так будет более эффективно по скорости, потому что мы считаем один раз, а не два. |
||||||||||||
| return gc_count/len(self.seq)*100 | ||||||||||||
|
|
||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Аннотация типа возвращает Union[int, float], но на практике это всегда будет float
Suggested change
|
||||||||||||
|
|
||||||||||||
| class DNASequence(NucleicAcidSequence): | ||||||||||||
| alphabet = ('A', 'T', 'G', 'C') | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Регистрозависимость, на вход будут приниматься только записи большими буквами, возможно, стоит добавить upper в check_alphabet |
||||||||||||
| complement_dict = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'} | ||||||||||||
|
Comment on lines
+94
to
+95
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Я бы добавила строчные буквы или тогда делала бы upper для входной строки |
||||||||||||
|
|
||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Стоило бы добавить вариант со строчными буквами, аналогично и для функции ниже |
||||||||||||
| def __init__(self, seq: str): | ||||||||||||
| super().__init__(seq) | ||||||||||||
|
|
||||||||||||
| def check_alphabet(self): | ||||||||||||
| return super().check_alphabet() | ||||||||||||
|
|
||||||||||||
| def complement(self): | ||||||||||||
| return super().complement() | ||||||||||||
|
|
||||||||||||
| def transcribe(self) -> str: | ||||||||||||
| transcribed_seq = self.seq.translate(str.maketrans('ATGC', 'UACG')) | ||||||||||||
| return transcribed_seq | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| class RNASequence(NucleicAcidSequence): | ||||||||||||
| alphabet = ('A', 'U', 'G', 'C') | ||||||||||||
| complement_dict = {'A': 'U', 'U': 'A', 'G': 'C', 'C': 'G'} | ||||||||||||
|
|
||||||||||||
| def __init__(self, seq: str) -> None: | ||||||||||||
| super().__init__(seq) | ||||||||||||
|
|
||||||||||||
| def complement(self): | ||||||||||||
| return super().complement() | ||||||||||||
|
|
||||||||||||
| def check_alphabet(self): | ||||||||||||
| return super().check_alphabet() | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| class AminoAcidSequence(SequenceFunction): | ||||||||||||
| alphabet = ('A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V') | ||||||||||||
| amino_acid_weights = { | ||||||||||||
| 'A': 89, 'R': 174, 'N': 132, 'D': 133, 'C': 121, | ||||||||||||
| 'E': 147, 'Q': 146, 'G': 75, 'H': 155, 'I': 131, | ||||||||||||
| 'L': 131, 'K': 146, 'M': 149, 'F': 165, 'P': 115, | ||||||||||||
| 'S': 105, 'T': 119, 'W': 204, 'Y': 181, 'V': 117 | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Стоило бы добавить строчные буквы, чтобы подсчеты велись корректно |
||||||||||||
| def __init__(self, seq) -> None: | ||||||||||||
| super().__init__(seq) | ||||||||||||
|
|
||||||||||||
| def check_alphabet(self): | ||||||||||||
| return super().check_alphabet() | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Опять же, возможно, стоит добавить переведение последовательности в верхний регистр |
||||||||||||
|
|
||||||||||||
| def count_molecular_weight(self, amino_acid_weights: dict) -> int: | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. В аннотации указано, что метод count_molecular_weight возвращает int, но на самом деле результатом будет float |
||||||||||||
| molecular_weight = sum(self.amino_acid_weights.get(aa, 0) for aa in self.seq) | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Метод count_molecular_weight принимает параметр amino_acid_weights, который не используется. Вместо этого, используется атрибут класса There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||
| return molecular_weight | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Биологически при вычислении массы белка нужно вычитать воду, можно добавить эти сроки в код |
||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Классно, что есть Union! Мне самое порой лениво его использовать, так что респект