Review RGL1 by nvaulin · Pull Request #36 · Python-BI-2023/Peer_review

nvaulin · 2024-02-26T17:57:38Z

Review RGL1

nerofeeva2001

Хороший код, есть несколько недочетов, но это некритично)

nerofeeva2001 · 2024-03-09T02:08:16Z

RGL1.py

+from typing import Union
+from Bio import SeqIO, SeqUtils
+from abc import ABC, abstractmethod
+


Здорово, что используется Union, это упрощает понимание кода

nerofeeva2001 · 2024-03-09T02:10:32Z

RGL1.py

+    if (quality_threshold < sum(record.letter_annotations['phred_quality']) /
+            len(record.letter_annotations['phred_quality'])):
+        return record
+


Все три функции выше возвращают record или None. Это подход работает, но его можно упростить, изменив функции так, чтобы они возвращали True или False, что будет более понятно для проверки условий

nerofeeva2001 · 2024-03-09T02:17:36Z

RGL1.py

+
+def filter_fastq(file_path: str, output_file: str = 'input_fasta_name_filter.fastq',
+                 lower_gc_bound: int = 0, upper_gc_bound: int = 30,
+                 lower_length_bound: int = 0, upper_length_bound: int = 2**32,


Тут тоже можно было написать с помощью Union

nerofeeva2001 · 2024-03-09T02:19:32Z

RGL1.py

+                    gc_filter(record, lower_gc_bound, upper_gc_bound) and \
+                    quality_filter(record, quality_threshold):
+                SeqIO.write(record, output, 'fastq')
+


Повторная запись в файл для каждой подходящей записи через SeqIO.write(record, output, 'fastq') в цикле неэффективна, так как каждая запись обрабатывается отдельно. Лучше собрать подходящие записи в список, а затем записать их одним вызовом SeqIO.write(filtered_records, output, 'fastq')

В данном случае есть 2 варианта

Если читаем все разом, то и записывать лучше все разом, да

Но можно обрабатывать данные "на лету" - прочитали одну запись и если надо - записали. В таком случае просто лучше держать все время файлы открытыми (делать все внутри блока with open). Так мне даже чуть больше нравится

nerofeeva2001 · 2024-03-09T02:20:23Z

RGL1.py

+                    gc_filter(record, lower_gc_bound, upper_gc_bound) and \
+                    quality_filter(record, quality_threshold):
+                SeqIO.write(record, output, 'fastq')
+


Нет проверки корректности диапазонов для lower_gc_bound, upper_gc_bound, lower_length_bound, upper_length_bound и quality_threshold

nerofeeva2001 · 2024-03-09T02:41:03Z

RGL1.py

+
+class SequenceFunction(BiologicalSequence):
+    alphabet = ()
+


Стоило бы настроить алфавит

nerofeeva2001 · 2024-03-09T02:42:28Z

RGL1.py

+            return self.seq[item]
+        else:
+            raise IndexError('Your index is incorrect')
+


Это избыточно, Python сам генерирует IndexError для индексов вне диапазона

Suggested change

def __getitem__(self, item: int) -> str:

return self.seq[item]

nerofeeva2001 · 2024-03-09T02:44:03Z

RGL1.py

+    def gc_content(self) -> Union[int, float]:
+        gc_count = self.seq.count('C') + self.seq.count('G')
+        return gc_count/len(self.seq)*100
+


Аннотация типа возвращает Union[int, float], но на практике это всегда будет float

Suggested change

def gc_content(self) -> float:

nerofeeva2001 · 2024-03-09T02:47:23Z

RGL1.py

+    def check_alphabet(self):
+        return super().check_alphabet()
+
+    def count_molecular_weight(self, amino_acid_weights: dict) -> int:


В аннотации указано, что метод count_molecular_weight возвращает int, но на самом деле результатом будет float

nerofeeva2001 · 2024-03-09T02:49:13Z

RGL1.py

+        return super().check_alphabet()
+
+    def count_molecular_weight(self, amino_acid_weights: dict) -> int:
+        molecular_weight = sum(self.amino_acid_weights.get(aa, 0) for aa in self.seq)


Метод count_molecular_weight принимает параметр amino_acid_weights, который не используется. Вместо этого, используется атрибут класса

Suggested change

molecular_weight = sum(self.amino_acid_weights.get(aa, 0) for aa in self.seq)

def count_molecular_weight(self) -> int:

molecular_weight = sum(self.amino_acid_weights.get(aa, 0) for aa in self.seq)

return molecular_weight

LinaWhite15

В целом работа хорошая, код рабочий, написанную утилиту можно использовать для обработки данных. Оформление аккуратное, есть аннотации типов, соблюдены отступы, наименования понятны. Небольшие огрехи довольно легко исправляются.

LinaWhite15 · 2024-03-10T07:46:25Z

RGL1.py

+
+def length_filter(record, lower_length_bound: int, upper_length_bound: int):
+    if lower_length_bound <= len(record.seq) <= upper_length_bound:
+        return record


Немного избыточно, можно либо True, либо False

LinaWhite15 · 2024-03-10T07:47:52Z

RGL1.py

+def filter_fastq(file_path: str, output_file: str = 'input_fasta_name_filter.fastq',
+                 lower_gc_bound: int = 0, upper_gc_bound: int = 30,
+                 lower_length_bound: int = 0, upper_length_bound: int = 2**32,
+                 quality_threshold: Union[int, float] = 0) -> None:


Хорошее использование модулей

LinaWhite15 · 2024-03-10T07:49:46Z

RGL1.py

+                SeqIO.write(record, output, 'fastq')
+
+
+class BiologicalSequence(ABC):


Реализованы все необходимые абстрактные методы

LinaWhite15 · 2024-03-10T07:52:26Z

RGL1.py

+        pass
+
+
+class SequenceFunction(BiologicalSequence):


Крутое решение с введением вспомогательного класса

LinaWhite15 · 2024-03-10T07:54:26Z

RGL1.py

+        self.length = len(self.seq)
+        return self.length
+
+    def __getitem__(self, item: int) -> str:


Работающее решение, но было бы здорово, если бы пользователь мог получить не только отдельный элемент, но целый срез

LinaWhite15 · 2024-03-10T07:58:55Z

RGL1.py

+        super().__init__(seq)
+
+    def complement(self) -> str:
+        complement_seq = self.seq.translate(str.maketrans(self.complement_dict))


Интересное решение!

LinaWhite15 · 2024-03-10T08:04:03Z

RGL1.py

+
+
+class DNASequence(NucleicAcidSequence):
+    alphabet = ('A', 'T', 'G', 'C')


Регистрозависимость, на вход будут приниматься только записи большими буквами, возможно, стоит добавить upper в check_alphabet

LinaWhite15 · 2024-03-10T08:06:18Z

RGL1.py

+        super().__init__(seq)
+
+    def check_alphabet(self):
+        return super().check_alphabet()


Опять же, возможно, стоит добавить переведение последовательности в верхний регистр

sofiyaga57

Очень классная работа! Было понятно и удобно читать код. Возможно, для человека, который впервые столкнулся бы с модулем, без докстринги было бы сложно понять, что происходит. Есть классные решения, взяла на заметку:)

sofiyaga57 · 2024-03-10T11:22:45Z

RGL1.py

@@ -0,0 +1,142 @@
+from typing import Union


Классно, что есть Union! Мне самое порой лениво его использовать, так что респект

sofiyaga57 · 2024-03-10T11:23:50Z

RGL1.py

+from abc import ABC, abstractmethod
+
+
+def length_filter(record, lower_length_bound: int, upper_length_bound: int):


Немножко не хватило аннотации, как-то было не очевидно сразу, что он может выдавать и record, и None

sofiyaga57 · 2024-03-10T11:24:08Z

RGL1.py

+def filter_fastq(file_path: str, output_file: str = 'input_fasta_name_filter.fastq',
+                 lower_gc_bound: int = 0, upper_gc_bound: int = 30,
+                 lower_length_bound: int = 0, upper_length_bound: int = 2**32,
+                 quality_threshold: Union[int, float] = 0) -> None:


sofiyaga57 · 2024-03-10T11:28:55Z

RGL1.py

+    def __getitem__(self, item: int) -> str:
+        if 0 <= item < len(self.seq):
+            return self.seq[item]
+        else:
+            raise IndexError('Your index is incorrect')


Немножко странно, что ошибка по индексу будет вылезать при отрицательных значениях индекса: это же ок для питона, мы можем обращаться к [-1] и тд

sofiyaga57 · 2024-03-10T11:34:08Z

RGL1.py

+        return complement_seq
+
+    def gc_content(self) -> Union[int, float]:
+        gc_count = self.seq.count('C') + self.seq.count('G')


Suggested change

gc_count = self.seq.count('C') + self.seq.count('G')

gc_count = sum(1 for base in self.seq if base in 'GC')

Можно попробовать так, как разбирали недавно на паре. Мне кажется, так будет более эффективно по скорости, потому что мы считаем один раз, а не два.

sofiyaga57 · 2024-03-10T11:36:18Z

RGL1.py

+                 lower_gc_bound: int = 0, upper_gc_bound: int = 30,
+                 lower_length_bound: int = 0, upper_length_bound: int = 2**32,


Возможно, стоило бы вынести эти параметры как константы вне функции

sofiyaga57 · 2024-03-10T11:38:01Z

RGL1.py

+    alphabet = ('A', 'T', 'G', 'C')
+    complement_dict = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}


Я бы добавила строчные буквы или тогда делала бы upper для входной строки

sofiyaga57 · 2024-03-10T11:40:09Z

RGL1.py

+        super().__init__(seq)
+
+    def complement(self) -> str:
+        complement_seq = self.seq.translate(str.maketrans(self.complement_dict))


Прикольно с maketrans:)) не знала о существовании этой функции

Add RGL1.py

66cdc21

nerofeeva2001 reviewed Mar 9, 2024

View reviewed changes

LinaWhite15 reviewed Mar 10, 2024

View reviewed changes

sofiyaga57 reviewed Mar 10, 2024

View reviewed changes


	def __getitem__(self, item: int) -> str:
	return self.seq[item]

		SeqIO.write(record, output, 'fastq')


		class BiologicalSequence(ABC):



		class DNASequence(NucleicAcidSequence):
		alphabet = ('A', 'T', 'G', 'C')

		from abc import ABC, abstractmethod


		def length_filter(record, lower_length_bound: int, upper_length_bound: int):

	gc_count = self.seq.count('C') + self.seq.count('G')
	gc_count = sum(1 for base in self.seq if base in 'GC')

		lower_gc_bound: int = 0, upper_gc_bound: int = 30,
		lower_length_bound: int = 0, upper_length_bound: int = 2**32,

		alphabet = ('A', 'T', 'G', 'C')
		complement_dict = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}

Conversation

nvaulin commented Feb 26, 2024

Uh oh!

nerofeeva2001 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LinaWhite15 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sofiyaga57 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants