Review GAD2 by nvaulin · Pull Request #24 · Python-BI-2023/Peer_review

nvaulin · 2024-02-26T17:54:44Z

Review GAD2

grishchenkoira

Хорошая работа! 🐱 Очень понравился фильтратор. Ниже есть несколько замечаний

grishchenkoira · 2024-03-04T14:14:46Z

GAD2.py

+import os
+import re
+from typing import TextIO, Optional, Union
+from abc import ABC, abstractmethod
+from Bio import SeqIO, SeqUtils


Правильно остортировано по происхождению бибилиотек, но нет лексикографического порядка.

Suggested change

import os

import re

from typing import TextIO, Optional, Union

from abc import ABC, abstractmethod

from Bio import SeqIO, SeqUtils

import os

import re

from abc import ABC, abstractmethod

from typing import TextIO, Optional, Union

from Bio import SeqIO, SeqUtils

grishchenkoira · 2024-03-04T14:44:46Z

GAD2.py

+    complement_rule = {'a': 't', 'A': 'T', 't': 'a', 'T': 'A',
+                       'g': 'c', 'G': 'C', 'c': 'g', 'C': 'G'}


Здорово, что переменная вынесена до объявления всех методов. Она для всего класса и не предполагается, что она будет изменяться у экземпляров. Лучше записать заглавными буквами.

grishchenkoira · 2024-03-04T14:53:20Z

GAD2.py

+        Function can procced only DNA seqences.
+        """
+        transcribed_sequence = ''
+        transcribed_sequence = self.seq.replace('t', 'u').replace('T', 'U')


Нет проверки на корректность. Если в сиквенсе будет буква, которой нет в алфавите, то она появится и в новой последовательности.

Ну в целом проверка это не зона отвественности транскрайба

grishchenkoira · 2024-03-04T18:24:48Z

GAD2.py

+from abc import ABC, abstractmethod
+from Bio import SeqIO, SeqUtils
+
+class BiologicalSequence(ABC):


Отмечу здесь, потому что касается всех классов для последовательностей.

Нет аннотации типов.

grishchenkoira · 2024-03-04T18:28:42Z

GAD2.py

+
+    def __init__(self, seq) -> None:
+        super().__init__(seq = seq)
+


Нет класса для белковых последовательностей 😿

grishchenkoira · 2024-03-04T18:30:58Z

GAD2.py

+    if not input_path.endswith('.fastq'):
+        raise ValueError('Incorrect input file extension, should be .fastq')   


Здорово, что есть контроль входного файла!

grishchenkoira · 2024-03-04T18:38:53Z

GAD2.py

+    filtererd_fastq = open(os.path.join(output_path, output_filename), mode='w')
+    for seq_record in SeqIO.parse(input_path, "fastq"):
+        if check_gc(seq_record.seq, gc_params) and check_length(seq_record.seq, len_bound_params) and check_quality(seq_record, quality_threshold):
+                SeqIO.write(seq_record, filtererd_fastq, "fastq")  
+    filtererd_fastq.close()   


Лучше использовать контекстный менеджер с оператором with, чтобы точно знать, что все закроется и не произойдет непредвиденного

Даже в этом случае, согласен

KirPetrikov

Привет! Хорошая работа!
Куда-то потерялся AminoAcidSequence :'(
Функции для фильтрации, по-моему, уже можно было не выписывать отдельными, а реализовать сразу в run_fastq_filter. Но это не принципиально.
Докстринги - моё почтение!
Немного учесть замечания - и будет просто супер! Удачи!

KirPetrikov · 2024-03-09T16:21:01Z

GAD2.py

+from abc import ABC, abstractmethod
+from Bio import SeqIO, SeqUtils
+
+class BiologicalSequence(ABC):


Насколько я разобрался: абстрактный класс написан прям как надо

KirPetrikov · 2024-03-09T16:21:28Z

GAD2.py

+    """Proccising nucleic acid sequences.
+    This class is parental for DNASequence and RNASequence classes.


Докстринги - замечательно.

KirPetrikov · 2024-03-09T16:23:07Z

GAD2.py

+    """Proccising nucleic acid sequences.
+    This class is parental for DNASequence and RNASequence classes.
+    """
+    def __init__(self, seq) -> None:


Обратил внимание: у классов аннотация типов только для ретёрна у инитов. Кажется, инитам она как раз не очень нужна. И точно для аргументов можно добавить.

Suggested change

def __init__(self, seq) -> None:

def __init__(self, seq: str) -> None:

KirPetrikov · 2024-03-09T16:25:02Z

GAD2.py

+        if self.check_sequence():
+            complement_sequence = ''
+            for nucleotide in self.seq:
+                if nucleotide in type(self).complement_rule:


Здесь у экземпляра класса вызывается атрибут complement_rule, но в самом классе NucleicAcidSequence его нет. Кажется, это не очень корректно.
Предлагаю такой вариант: в NucleicAcidSequence ставим всем подобным атрибутам = None. Тогда проверку для этих атрибутов в методах делаем такую:

if self.complement_rule is None: raise NotImplementedError

Когда пишите код в комментах PR, добавляйте python после ``` - так синтаксис будет подсвечен как надо

KirPetrikov · 2024-03-09T16:25:48Z

GAD2.py

+        """
+        if self.check_sequence():
+            complement_sequence = ''
+            for nucleotide in self.seq:


Такую проверку можно сделать не в цикле, а просто проверив подмножество:

if set(self.seq).issubset(set(complement_rule)):

KirPetrikov · 2024-03-09T16:37:56Z

GAD2.py

+        return True
+
+
+def int_to_tuple(input_parameters) -> tuple:


Вынести это в отдельную функцию - очень правильное решение, по-моему, упрощает логику кода.

KirPetrikov · 2024-03-09T16:39:03Z

GAD2.py

+    return (0, input_parameters)
+
+
+def run_fastq_filter(input_path: Optional[str] = None, output_filename: Optional[str] = None, gc_bounds: Union[int, tuple] = (0, 100), length_bounds: Union[int, tuple] = (0, 2**32), quality_threshold: int = 0) -> TextIO:


А зачем дефолтный None для input_path? Всё равно упадёт с ошибкой при попытке прочитать None, но если у необходимого атрибута не будет дефолтного значения и его забудут указать, ошибка будет другая, понятнее.

Suggested change

def run_fastq_filter(input_path: Optional[str] = None, output_filename: Optional[str] = None, gc_bounds: Union[int, tuple] = (0, 100), length_bounds: Union[int, tuple] = (0, 2**32), quality_threshold: int = 0) -> TextIO:

def run_fastq_filter(input_path: Optional[str], output_filename: Optional[str] = None, gc_bounds: Union[int, tuple] = (0, 100), length_bounds: Union[int, tuple] = (0, 2**32), quality_threshold: int = 0) -> TextIO:

KirPetrikov · 2024-03-09T16:40:31Z

GAD2.py

+        - reads quality score (quality_threshold).
+
+    Input:
+    - input_path (str): path to .fastq file; include 4 strings: 1 - read ID, 2 - sequence, 3 - comment, 4 - quality. Default - None.


Кажется, описание спецификации fastq-формата уже лишнее)

Suggested change

- input_path (str): path to .fastq file; include 4 strings: 1 - read ID, 2 - sequence, 3 - comment, 4 - quality. Default - None.

- input_path (str): path to .fastq file.

KirPetrikov · 2024-03-09T16:41:29Z

GAD2.py

+    - output_filename (str): name of output file, by default, it will be saved in the directory 'fastq_filtrator_resuls'. Default name will be name of input file.
+    - gc_bounds (tuple or int): GC content filter parameters, it accepts lower and upper (tuple), or only upper threshold value (int). Default value (0, 100).
+    - length_bounds (tuple or int): read length filter parameters, it accepts lower and upper (tuple), or only upper threshold value (int). Default value (0, 2**32).
+    - quality_threshold (int): upper quality threshold in phred33 scale. Reads with average quality below the threshold are discarded. Default value - 0. 


Только не верхняя граница, а нижняя.

Suggested change

- quality_threshold (int): upper quality threshold in phred33 scale. Reads with average quality below the threshold are discarded. Default value - 0.

- quality_threshold (int): lower quality threshold in phred33 scale. Reads with average quality below the threshold are discarded. Default value - 0.

KirPetrikov · 2024-03-09T16:42:22Z

GAD2.py

+    gc_params = int_to_tuple(gc_bounds)
+    len_bound_params = int_to_tuple(length_bounds)    
+    "Filter and record results"
+    filtererd_fastq = open(os.path.join(output_path, output_filename), mode='w')


Работать с файлами через with open надёжнее. Если, например, произойдёт ошибка перед filtererd_fastq.close(), файл не закроется.

Sarsaparella

Все аккуратно и чистно, радуют красивые докстринги и оформление. Некоторые стистические решения мне показались не очень понятными, но мб это мой недостаток. Нет класса AminoAcidSequence, ну ничего, он почти такой же как и NucleicAcidSequence, только там алфавит другой 😁

Sarsaparella · 2024-03-10T11:31:13Z

GAD2.py

+    def __init__(self, seq) -> None:
+        self.seq = seq


Я не очень понимаю зачем тут писать "-> None" 🤔 хотя сам концепт мне нравится (первый раз с ним сталкиваюсь и возьму себе на вооружение)

__init__ должен возвращать None по определению, иначе питон бросит ошибку, поэтому да, в случае с __init__ так делать не надо. Но когда просто функция кастомная, то там такая аннотация будет полезна

Sarsaparella · 2024-03-10T11:36:01Z

GAD2.py

+    def gc_content(self):
+        return (sum(1 for _ in re.finditer(r'[GCgc]', self.seq)))/self.__len__()


Ничего плохого в этом способе нет, однако выглядит немного неинтуитивно что ли, сложно читат 🤧

Sarsaparella · 2024-03-10T11:44:19Z

GAD2.py

+                       'g': 'c', 'G': 'C', 'c': 'g', 'C': 'G'}
+
+    def __init__(self, seq: str) -> None:
+        super().__init__(seq = seq)


Возможно я чего-то не понимаю, но опять же не понятно зачем писать (seq = seq) 🤔

В целом явное лучше чем не явное, поэтому прописывать в какие аргументы какие значения мы передаем - звучит полезно

Add GAD2.py

5f18a28

grishchenkoira reviewed Mar 4, 2024

View reviewed changes

KirPetrikov reviewed Mar 9, 2024

View reviewed changes

Sarsaparella reviewed Mar 10, 2024

View reviewed changes

		complement_rule = {'a': 't', 'A': 'T', 't': 'a', 'T': 'A',
		'g': 'c', 'G': 'C', 'c': 'g', 'C': 'G'}

		if not input_path.endswith('.fastq'):
		raise ValueError('Incorrect input file extension, should be .fastq')

		"""Proccising nucleic acid sequences.
		This class is parental for DNASequence and RNASequence classes.

	def __init__(self, seq) -> None:
	def __init__(self, seq: str) -> None:

		return (0, input_parameters)


		def run_fastq_filter(input_path: Optional[str] = None, output_filename: Optional[str] = None, gc_bounds: Union[int, tuple] = (0, 100), length_bounds: Union[int, tuple] = (0, 2**32), quality_threshold: int = 0) -> TextIO:

	- input_path (str): path to .fastq file; include 4 strings: 1 - read ID, 2 - sequence, 3 - comment, 4 - quality. Default - None.
	- input_path (str): path to .fastq file.

	- quality_threshold (int): upper quality threshold in phred33 scale. Reads with average quality below the threshold are discarded. Default value - 0.
	- quality_threshold (int): lower quality threshold in phred33 scale. Reads with average quality below the threshold are discarded. Default value - 0.

		def gc_content(self):
		return (sum(1 for _ in re.finditer(r'[GCgc]', self.seq)))/self.__len__()

Conversation

nvaulin commented Feb 26, 2024

Uh oh!

grishchenkoira left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KirPetrikov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sarsaparella left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants