-
Notifications
You must be signed in to change notification settings - Fork 0
csv
Okerew edited this page Sep 13, 2025
·
1 revision
The CSVParser class is designed to parse CSV files containing biological sequence data (DNA, RNA, or Protein) and convert them into corresponding Python objects (DNA, RNA, or Protein). It leverages the pandas library to read and process CSV files, making it easy to handle large datasets efficiently.
class CSVParser:
def __init__(self, file_path):
"""
Initialize a new CSVParser object.
:param file_path: Path to the CSV file containing biological sequence data
"""
...| Attribute | Type | Description |
|---|---|---|
file_path |
str |
Path to the CSV file. |
data |
pandas.DataFrame |
DataFrame containing the parsed CSV data. |
-
__init__(self, file_path)Initializes a newCSVParserinstance with the specified CSV file path. The CSV file is read into apandas.DataFramefor further processing.
-
parse_records(self) -> List[Union[DNA, RNA, Protein]]Parses the CSV data and converts each row into the appropriate biological object (DNA,RNA, orProtein).-
Returns: A list of parsed biological objects (
DNA,RNA, orProtein). -
Details:
- Iterates over each row in the CSV file.
- Determines the type of sequence (
DNA,RNA, orProtein) based on thetypecolumn. - Creates the corresponding object and appends it to the result list.
-
The CSV file should have the following columns:
| Column Name | Description |
|---|---|
sequence |
The biological sequence (e.g., "ATGC" for DNA, "AUG" for RNA, "ACDEF" for Protein). |
type |
The type of sequence (DNA, RNA, or Protein). |
id |
(Optional) Unique identifier for the sequence (required for Protein objects). |
sequence,type,id
ATGCGATCG,DNA,
AUGCCGUA,RNA,
ACDEFGHIKLMNPQRSTVWY,Protein,Protein1# Initialize the CSVParser with the path to the CSV file
parser = CSVParser(file_path="sequences.csv")
# Parse the records
parsed_elements = parser.parse_records()
# Print the parsed elements
for element in parsed_elements:
print(element)DNA Sequence: ATGCGATCG
RNA Sequence: AUGCCGUA
Protein: Protein1, Sequence: ACDEFGHIKLMNPQRSTVWY
-
pandas: Used for reading and processing CSV files. -
DNA: Class representing DNA sequences. -
RNA: Class representing RNA sequences. -
Protein: Class representing protein sequences.
- If the CSV file does not contain the required columns (
sequence,type), theCSVParserwill raise aKeyError. - If the
typecolumn contains an invalid value (notDNA,RNA, orProtein), the corresponding row will be skipped.
- The
CSVParserclass is designed to be flexible and can be extended to support additional biological sequence types or custom parsing logic. - The
idcolumn is optional and only required forProteinobjects. If not provided, theProteinconstructor should handle it appropriately.