Review PCDHA12 by nvaulin · Pull Request #27 · Python-BI-2023/Peer_review

nvaulin · 2024-02-26T17:57:14Z

Review PCDHA12

anshtompel

Хорошая работа! Понравился фильтратор, а еще некоторые реализации функций, очень компактно. Спасибо за работу!

anshtompel · 2024-03-08T17:47:42Z

PCDHA12.py

+def filter_fastq(input_path: str,
+                 gc_bounds: tuple = (0, 100),
+                 length_bounds: tuple = (0, 2 ** 32),


Всё круто! Хорошее решение делать фильтрацию, возвращая bool от проверок

anshtompel · 2024-03-08T17:50:40Z

PCDHA12.py

+class BiologicalSequence(ABC, str):
+    @abstractmethod
+    def check_alphabet(self) -> bool:
+        pass


В вашем абстрактном классе не реализованы еще сет методов: работа c len, индексация и вывод в печать

anshtompel · 2024-03-08T17:51:40Z

PCDHA12.py

+        complement_seq = ''.join(type(self).COMPLEMENT_DICT.get(base)
+                                 for base in self.sequence)


Интересная конструкция) Возьму на заметку

anshtompel · 2024-03-08T17:54:46Z

PCDHA12.py

+        pass
+
+
+class NucleicAcidSequence(BiologicalSequence):


Не хватает докстринг в классах

anshtompel · 2024-03-08T18:09:12Z

PCDHA12.py

+    ONE_LETTER_ALPHABET = ('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L',
+                           'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y')
+    THREE_LETTER_ALPHABET = ('Ala', 'Cys', 'Asp', 'Glu', 'Phe', 'Gly', 'His',
+                             'Ile', 'Lys', 'Leu', 'Met', 'Pro', 'Gln', 'Arg',
+                             'Ser', 'Thr', 'Val', 'Trp', 'Tyr')


Круто, что предусмотрено два варианта алфавита

BeskrovnaiaM

Супер! Ты молодец 🥇

BeskrovnaiaM · 2024-03-09T14:06:57Z

PCDHA12.py

+class BiologicalSequence(ABC, str):
+    @abstractmethod
+    def check_alphabet(self) -> bool:
+        pass
+


Здесь надо было добавить еще абстрактных методов, но об этом уже сказали :) (Прости, я вижу чужое ревью, только когда проверяю)

Все нормально! Нет ничего плохого в том чтобы повторять уже сказанные моменты, так как в первую очередь вам надо делать свое ревью.

BeskrovnaiaM · 2024-03-09T14:07:50Z

PCDHA12.py

+class RNASequence(NucleicAcidSequence):
+    ALPHABET = ('A', 'U', 'G', 'C', 'N')
+    COMPLEMENT_DICT = {'A': 'U', 'U': 'A', 'G': 'C', 'C': 'G', 'N': 'N',
+                       'a': 'u', 'u': 'a', 'g': 'c', 'c': 'g'}


Здорово! Выносить эти словари не в инит, по моему мнению очень разумно, как и не выносить их глобально за пределы классов.

А еще теперь это классовые атрибуты, поэтому их не надо делать капсом

BeskrovnaiaM · 2024-03-09T14:14:52Z

PCDHA12.py

+    def __init__(self, sequence):
+        raise NotImplementedError('An instance of this class cannot be created')


Здесь можно было бы при ините делать последовательность сразу большими буквами, чтобы потом в словаре не хранить и маленькие, и большие буквы. И в целом инит вынести в этот класс, и убрать его в дочерних, с точки зрения информации подаваемой, что АА, что РНК, что ДНК это все строчки с буковками. Так что я бы вместо ошибки добавила вот это, а потом от этого дальше наследовалась, чтобы не дублироваться.

Suggested change

def __init__(self, sequence):

raise NotImplementedError('An instance of this class cannot be created')

def __init__(self, sequence):

self.seq = seq.upper()

BeskrovnaiaM · 2024-03-09T14:19:21Z

PCDHA12.py

+    """ Reads a FASTQ file, filters sequences based on GC content, sequence
+    length and quality threshold, and writes the filtered sequences to
+    a new file """


Спасибо за докстринг! Но мне казалось, что туда необходимо добавлять, какой инпут и какой аутпут функции идет. Что-то вот такое

Suggested change

""" Reads a FASTQ file, filters sequences based on GC content, sequence

length and quality threshold, and writes the filtered sequences to

a new file """

"""

Reads a FASTQ file, filters sequences based on GC content, sequence

length and quality threshold, and writes the filtered sequences to

a new file

Args:

input_path (str): Path to the input FASTQ file.

gc_bounds (tuple): A tuple of two integers specifying the lower and

upper bounds of the GC content, in percent. Sequences with GC

content outside these bounds will be filtered out.

length_bounds (tuple): A tuple of two integers specifying the lower and

upper bounds of the sequence length, in base pairs. Sequences

with length outside these bounds will be filtered out.

quality_threshold (int): An integer specifying the minimum average

quality score for a sequence to be kept. Sequences with average

quality score below this threshold will be filtered out.

"""

icalledmyselfmoon · 2024-03-10T11:15:46Z

PCDHA12.py

+
+def filter_quality(record: SeqRecord, quality_threshold: int) -> bool:
+    avg_quality = np.mean(record.letter_annotations["phred_quality"])
+    return avg_quality >= quality_threshold


мне очень нравится реализация этих функций: кратко и понятно (есть аннотация с каждому аргументу, почти все уместилось на одну строку, без лишних переменных)

однако я помню, что по заданию границы для длины и GC состава могут быть заданы как в виде одного числа, так и в виде интервала. в ваших функциях есть только интервал

icalledmyselfmoon · 2024-03-10T11:45:38Z

PCDHA12.py

+    a new file """
+    path, filename = os.path.split(input_path)
+    name, ext = os.path.splitext(filename)
+    output_path = path + "/" + f"{name}_filtered{ext}"


мне нравится способ генерации имени выходного файла, здесь также можно было учесть из тз, что пользователь сам тоже может задавать имя файла.

icalledmyselfmoon · 2024-03-10T11:49:35Z

PCDHA12.py

+    output_path = path + "/" + f"{name}_filtered{ext}"
+    input_seq_iterator = SeqIO.parse(input_path, 'fastq')
+    filtered_seq_iterator = (record for record in input_seq_iterator
+                             if filter_length(record, length_bounds)


отличный способ проверить все три условия и сразу записать, list comprehension тут лаконично и эффективно вписался. то, что функции возвращают bool позволяет его провести- это хороший подход

icalledmyselfmoon · 2024-03-10T12:02:06Z

PCDHA12.py

+class RNASequence(NucleicAcidSequence):
+    ALPHABET = ('A', 'U', 'G', 'C', 'N')
+    COMPLEMENT_DICT = {'A': 'U', 'U': 'A', 'G': 'C', 'C': 'G', 'N': 'N',
+                       'a': 'u', 'u': 'a', 'g': 'c', 'c': 'g'}


да, верно, словарь лучше вынести за функцию. помимо этого, мне нравится логика наследования методов и полиморфизм здесь: то есть методы проверки на ДНК/РНК и построения комплементарной цепи реализуются для каждого дочернего класса называются одинаково, но учитывают отличия дня/рнк

icalledmyselfmoon · 2024-03-10T12:19:45Z

PCDHA12.py

+    def transcribe(self) -> RNASequence:
+        """Transcribes DNA sequence into RNA sequence.
+        Returns RMASequence object"""
+        return RNASequence(self.sequence.replace('T', 'U').replace('t', 'u'))


я также реализовала транскрипцию, но оказалось это не самый оптимальный вариант, так как двойной replace = двойной проход по последовательности.

молодец, что возвращаешь экземпляр класса RNASeq

icalledmyselfmoon · 2024-03-10T12:20:37Z

PCDHA12.py

+
+    def get_molecular_weight(self) -> float:
+        """ calculate molecular weight for one-letter amino acid sequence"""
+        WEIGHT_DICT = {


насколько мне известно, словарь лучше вынести за функцию

icalledmyselfmoon · 2024-03-10T12:22:09Z

PCDHA12.py

+            'D': 115.087, 'Q': 128.129, 'K': 128.172, 'E': 129.114, 'M': 131.196,
+            'H': 137.139, 'F': 147.174, 'R': 156.186, 'Y': 163.173, 'W': 186.210
+        }
+        terminal_h_oh_weight = 18.02


круто, даже масса терминальной hoh учтена)

Add PCDHA12.py

e08357a

anshtompel reviewed Mar 8, 2024

View reviewed changes

BeskrovnaiaM reviewed Mar 9, 2024

View reviewed changes

icalledmyselfmoon reviewed Mar 10, 2024

View reviewed changes

		complement_seq = ''.join(type(self).COMPLEMENT_DICT.get(base)
		for base in self.sequence)

		def __init__(self, sequence):
		raise NotImplementedError('An instance of this class cannot be created')

-    """ Reads a FASTQ file, filters sequences based on GC content, sequence
-    length and quality threshold, and writes the filtered sequences to
-    a new file """
+    """
+     Reads a FASTQ file, filters sequences based on GC content, sequence
+    length and quality threshold, and writes the filtered sequences to
+    a new file
+        Args:
+        input_path (str): Path to the input FASTQ file.
+        gc_bounds (tuple): A tuple of two integers specifying the lower and
+            upper bounds of the GC content, in percent. Sequences with GC
+            content outside these bounds will be filtered out.
+        length_bounds (tuple): A tuple of two integers specifying the lower and
+            upper bounds of the sequence length, in base pairs. Sequences
+            with length outside these bounds will be filtered out.
+        quality_threshold (int): An integer specifying the minimum average
+            quality score for a sequence to be kept. Sequences with average
+            quality score below this threshold will be filtered out.
+      """

Conversation

nvaulin commented Feb 26, 2024

Uh oh!

anshtompel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BeskrovnaiaM left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants