Review TCF4 by nvaulin · Pull Request #37 · Python-BI-2023/Peer_review

nvaulin · 2024-02-26T17:57:41Z

Review TCF4

angrygeese

Оба задания сделаны, поэтому уже хорошая работа.

К первой части, FastQ-фильтратору, принципиальных замечаний нет.

Во второй, "Biological sequence world", лучше было бы:

Не наследоваться от встроенных классов.
Не прописывать какие-либо детали касательно классов-наследников в родительском классе, а задействовать атрибуты классов-наследников.

angrygeese · 2024-03-02T21:56:31Z

TCF4.py

+class InvalidInputError(ValueError):
+    pass
+
+class BiologicalSequence(ABC, str):


При таком наследовании BiologicalSequence и его наследники переймут все методы класса str, которые ты не переопределишь. Например, у экземпляров твоего DNASequence есть методы count и transtale.

Suggested change

class BiologicalSequence(ABC, str):

class BiologicalSequence(ABC):

angrygeese · 2024-03-02T22:01:16Z

TCF4.py

+            if letter in freq_dict:
+                freq_dict[letter] += 1
+            else:
+                freq_dict[letter] = 1


Логика абсолютно правильная, есть способ записать её несколько короче.

У словарей в Python есть метод get, он принимает два аргумента - ключ и любой аргумент. Если ключ в словаре есть, возвращает значение из словаря по этому ключу, если нет - возвращает второй аргумент.

Suggested change

if letter in freq_dict:

freq_dict[letter] += 1

else:

freq_dict[letter] = 1

freq_dict[letter] = freq_dict.get(letter, 0) + 1

Его используют сильно реже, но тут кажется отличный кейс для get

angrygeese · 2024-03-02T22:02:04Z

TCF4.py

+    def amino_acid_frequency(self):
+        """Calculates molecular weight of a protein


Не из той функции взята докстринга.

angrygeese · 2024-03-02T22:02:35Z

TCF4.py

+class AminoAcidSequence(BiologicalSequence):
+    def __init__(self, seq):
+        self.seq = seq


Здесь есть маленькая деталь: ни в классе-родителе, ни в самом AminoAcidSequence не реализована проверка алфавита последовательности на корректность. Поэтому я могу, например, создать экземпляр AminoAcidSequence, передав конструктору класса число 1000. И потом запросить для него частоту аминокислот.

angrygeese · 2024-03-03T07:13:19Z

TCF4.py

+    @abstractmethod
+    def __init__(self, seq):
+        self.seq = seq


Насколько я успел посмотреть, в абстрактных классах не используют связку __init__ и @abstractmethod.

Обычно делают так: в родительском, абстрактном классе объявляют переменную класса __slots__. Это такой способ зарезервировать память под набор атрибутов. Затем в __init__ класса-наследника их определяют.

Suggested change

@abstractmethod

def __init__(self, seq):

self.seq = seq

__slots__ = ('seq',)

Если вдруг станет интересно, что можно посмотреть:

Исходный код абстрактного класса для последовательностей в Python.

Особенности использования __slots__ детально разобраны тут.

upd: Я всё-таки встретил связку __init__ и @abstractmethod.

angrygeese · 2024-03-04T19:49:30Z

TCF4.py

+    filename = input_path
+    records = SeqIO.parse(filename, "fastq")
+    ###quality filter
+    good_reads = (rec for rec in records if min(rec.letter_annotations["phred_quality"]) >= quality_threshold)
+    result_quality = SeqIO.write(good_reads, "good_quality.fastq", "fastq")
+    result_quality_GC = SeqIO.parse("good_quality.fastq", "fastq")
+    ###GC content filter
+    min_gc_content = gc_bounds[0]
+    max_gc_content = gc_bounds[1]
+    GC_quality_filt = []
+
+    for sequence in result_quality_GC:
+        if min_gc_content <= GC(sequence.seq) <= max_gc_content:
+            GC_quality_filt.append(sequence)
+
+    result_quality = SeqIO.write(GC_quality_filt, "good_quality_GC.fastq", "fastq")
+    result_quality_GC_length = SeqIO.parse("good_quality_GC.fastq", "fastq")
+
+    ##length filter
+    filtered_GC_quality_length = []
+
+    for sequence in result_quality_GC_length:
+        if len(sequence.seq) >= length_bounds[0] and len(sequence.seq) <= length_bounds[1]:
+            filtered_GC_quality_length.append(sequence)
+
+    result_quality = SeqIO.write(filtered_GC_quality_length, output_filename, "fastq")
+
+    print(result_quality)


Логику последовательных проверок можно реализовать чуть короче, не создавая итератор для каждой новой проверки:

Suggested change

filename = input_path

records = SeqIO.parse(filename, "fastq")

###quality filter

good_reads = (rec for rec in records if min(rec.letter_annotations["phred_quality"]) >= quality_threshold)

result_quality = SeqIO.write(good_reads, "good_quality.fastq", "fastq")

result_quality_GC = SeqIO.parse("good_quality.fastq", "fastq")

###GC content filter

min_gc_content = gc_bounds[0]

max_gc_content = gc_bounds[1]

GC_quality_filt = []

for sequence in result_quality_GC:

if min_gc_content <= GC(sequence.seq) <= max_gc_content:

GC_quality_filt.append(sequence)

result_quality = SeqIO.write(GC_quality_filt, "good_quality_GC.fastq", "fastq")

result_quality_GC_length = SeqIO.parse("good_quality_GC.fastq", "fastq")

##length filter

filtered_GC_quality_length = []

for sequence in result_quality_GC_length:

if len(sequence.seq) >= length_bounds[0] and len(sequence.seq) <= length_bounds[1]:

filtered_GC_quality_length.append(sequence)

result_quality = SeqIO.write(filtered_GC_quality_length, output_filename, "fastq")

print(result_quality)

def filter_fastq_corr(input_path: str, quality_threshold: int, output_filename="final_filtered_corr.fastq", gc_bounds=(40, 60), length_bounds=(50, 350)):

if isinstance(gc_bounds, int):

gc_bounds = (0, gc_bounds)

min_gc_content, max_gc_content = gc_bounds

length_min, length_max = length_bounds

records = SeqIO.parse(input_path, "fastq")

good_reads = []

for rec in records:

rec_len, rec_gc_content = len(rec.seq), GC(rec.seq)

if (

min(rec.letter_annotations["phred_quality"]) >= quality_threshold

and max_gc_content >= rec_gc_content >= min_gc_content

and length_max >= rec_len >= length_min

):

good_reads.append(rec)

result_quality = SeqIO.write(good_reads, output_filename, "fastq")

print(result_quality)

angrygeese · 2024-03-04T19:51:50Z

TCF4.py

+    for sequence in result_quality_GC_length:
+        if len(sequence.seq) >= length_bounds[0] and len(sequence.seq) <= length_bounds[1]:


Suggested change

for sequence in result_quality_GC_length:

if len(sequence.seq) >= length_bounds[0] and len(sequence.seq) <= length_bounds[1]:

length_min, length_max = length_bounds

for sequence in result_quality_GC_length:

seq_len = len(sequence.seq)

if seq_len >= length_min and seq_len <= length_max:

angrygeese · 2024-03-04T19:54:21Z

TCF4.py

+from Bio import SeqIO
+from Bio.SeqUtils import GC


Работа с SeqIO.parse реализована хорошо, единственное, он избыточен в паре мест.

angrygeese · 2024-03-06T17:23:21Z

TCF4.py

+
+class NucleicAcidSequence(BiologicalSequence):
+    def __init__(self, seq):
+        super().__init__(seq)


Благодаря этому моменту я разобрался в деталях работы super(). И что им, оказывается, можно задействовать смысловой код из абстрактного метода класса-родителя.

angrygeese · 2024-03-06T17:48:55Z

TCF4.py

+    def __len__(self):
+        return len(self.seq)
+
+    def __getitem__(self, index):
+        return self.seq[int(index)]


Я бы выполнил это задание в таком ключе, определив каждый метод как абстрактный, без конкретной реализации:

Suggested change

def __len__(self):

return len(self.seq)

def __getitem__(self, index):

return self.seq[int(index)]

@abstractmethod

def __len__(self):

raise NotImplementedError

@abstractmethod

def __getitem__(self, index):

raise NotImplementedError

Питон допускает реализацию в абстрактном методе, но это нужно для отдельных случаев с множественным наследованием.

Add TCF4.py

db1b1a6

angrygeese reviewed Mar 6, 2024

View reviewed changes

	class BiologicalSequence(ABC, str):
	class BiologicalSequence(ABC):

		def amino_acid_frequency(self):
		"""Calculates molecular weight of a protein

		for sequence in result_quality_GC_length:
		if len(sequence.seq) >= length_bounds[0] and len(sequence.seq) <= length_bounds[1]:

Conversation

nvaulin commented Feb 26, 2024

Uh oh!

angrygeese left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

angrygeese Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

angrygeese Mar 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

angrygeese Mar 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

angrygeese Mar 4, 2024 •

edited

Loading

angrygeese Mar 6, 2024 •

edited

Loading

angrygeese Mar 6, 2024 •

edited

Loading