-
Notifications
You must be signed in to change notification settings - Fork 14
Restructured project: Moved files into case_studies/CARD, removed red… #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 40 commits
cfc7bf6
9615773
9f06bb2
69916b8
0a3572e
08ed58f
a2643f1
9be2e3b
b0dbb23
8ddf883
b0c5dfa
2eb20ce
a532154
7aa8917
52ce540
4177654
444b520
e223f86
56addcc
f2af6f4
f590d94
5d174be
54e7b5b
1195e1e
2d80ab5
b709416
eca5d37
993bc09
ab67c1c
13a6e8b
9a7688d
14992a3
e105319
f6b87e7
bbb8c91
8e68be7
8afcba8
bcbd971
aee86b7
1dc5c81
4ddc8e1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,257 @@ | ||||||
| # config.R | ||||||
| url <- "https://card.mcmaster.ca/download/0/broadstreet-v3.3.0.tar.bz2" | ||||||
| destfile <- "broadstreet-v3.3.0.tar.bz2" | ||||||
|
|
||||||
| # Download the file | ||||||
| download.file(url, destfile) | ||||||
|
|
||||||
| #Extract the file | ||||||
| if (!require("R.utils")) { | ||||||
| install.packages("R.utils") | ||||||
| library(R.utils) | ||||||
| } | ||||||
|
|
||||||
|
|
||||||
| # Extract the tar file | ||||||
| untar("broadstreet-v3.3.0.tar.bz2", exdir = "CARD_data") | ||||||
|
|
||||||
|
|
||||||
| # Map CARD Short Name | ||||||
|
|
||||||
| # Parse the required files using readr::read_delim | ||||||
| aro_index <- read_delim("CARD_data/aro_index.tsv", delim = "\t", col_names = TRUE) | ||||||
| antibiotics_data <- read_delim("CARD_data/shortname_antibiotics.tsv", delim = "\t", col_names = TRUE) | ||||||
| pathogens_data <- read_delim("CARD_data/shortname_pathogens.tsv", delim = "\t", col_names = TRUE) | ||||||
|
|
||||||
|
|
||||||
|
|
||||||
| # Extract pathogen, gene, drug, and include Protein.Accession from 'CARD Short Name' | ||||||
| extract_card_info <- function(card_short_name, drug_class, `Protein Accession`, `DNA Accession`) { | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. rename all colnames with spaces and special characters to now include only @AbhirupaGhosh @charmvang @awasyn @epbrenner @the-mayer -- using camelCase for colnames or snake_case (without caps)? |
||||||
| # Split the CARD Short Name by underscores | ||||||
| split_names <- unlist(strsplit(card_short_name, "_")) | ||||||
|
|
||||||
| # Initialize variables with defaults | ||||||
| pathogen <- NA | ||||||
| gene <- NA | ||||||
| drug <- drug_class # Default to Drug Class column | ||||||
|
|
||||||
| # Determine the information based on the split names and patterns | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you share an example file (snippet pre and post name cleanup)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hello @jananiravi , by this do you mean I should use the View() function in R to allow for the visual inspection of the dataset before and after processing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, I meant snapshots or example data stored locally (as part of the commit) to be able to run the code and check locally. |
||||||
| if (length(split_names) == 1) { | ||||||
| # Gene only (single part entry) | ||||||
| gene <- split_names[1] | ||||||
| pathogen <- "MULTI" # Assign MULTI as default for pathogen | ||||||
| } else if (all(toupper(split_names) == split_names)) { | ||||||
| # Gene complex (all uppercase entries) | ||||||
| gene <- card_short_name # Entire entry as gene | ||||||
| pathogen <- "MULTI" | ||||||
| } else if (length(split_names) == 2) { | ||||||
| # Pathogen-Gene scenario | ||||||
| pathogen <- split_names[1] | ||||||
| gene <- split_names[2] | ||||||
| } else if (length(split_names) == 3) { | ||||||
| # Pathogen-Gene-Drug scenario | ||||||
| pathogen <- split_names[1] | ||||||
| gene <- split_names[2] | ||||||
| drug <- split_names[3] # Assign drug from the split entry | ||||||
| } | ||||||
|
|
||||||
| # If both pathogen and gene are NA, classify as complex gene | ||||||
| if (is.na(pathogen) && is.na(gene)) { | ||||||
| gene <- card_short_name # Assign entire CARD Short Name as gene | ||||||
| pathogen <- "MULTI" # Default to MULTI for pathogen | ||||||
| } | ||||||
|
|
||||||
| # Handle Protein Accession | ||||||
| if (is.na(`Protein Accession`) || `Protein Accession` == "") { | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if renamed above, there will be no colnames with spaces |
||||||
| `Protein Accession` <- `DNA Accession` # Use DNA Accession if Protein Accession is NA | ||||||
| } | ||||||
|
|
||||||
| return(list(Pathogen = pathogen, Gene = gene, Drug = drug, Protein_Accession = `Protein Accession`)) | ||||||
| } | ||||||
|
|
||||||
| # Apply the function to the data frame | ||||||
| resistance_profile_data <- aro_index %>% | ||||||
| mutate(extracted_info = pmap(list(`CARD Short Name`, `Drug Class`, `Protein Accession`, `DNA Accession`), | ||||||
| extract_card_info)) %>% | ||||||
| unnest_wider(extracted_info) | ||||||
|
|
||||||
| # View the resulting data frame | ||||||
| print(resistance_profile_data) | ||||||
|
|
||||||
| # Define a relative path for saving the data | ||||||
| output_path <- file.path("CARD_data", "resistance_profile_data.tsv") | ||||||
|
|
||||||
| # Save resistance_profile_data to the specified path | ||||||
| write_delim(resistance_profile_data, output_path, delim = "\t") | ||||||
|
|
||||||
| # Load data | ||||||
| resistance_profile_data <- read_delim(output_path, delim = "\t", col_names = TRUE) | ||||||
| antibiotics_data <- read_delim("CARD_data/shortname_antibiotics.tsv", delim = "\t", col_names = TRUE) | ||||||
| pathogens_data <- read_delim("CARD_data/shortname_pathogens.tsv", delim = "\t", col_names = TRUE) | ||||||
|
|
||||||
|
|
||||||
| # Merge the extracted resistance profile data with antibiotics_data on Drug | ||||||
| merged_data_antibiotics <- left_join( | ||||||
| resistance_profile_data, | ||||||
| antibiotics_data, | ||||||
| by = c("Drug" = "AAC Abbreviation"), # Adjusting for abbreviations between datasets | ||||||
| relationship = "many-to-many" | ||||||
| ) | ||||||
|
|
||||||
| # Merge the result with pathogens_data on Pathogen, renaming Pathogen.y to Pathogen_Full_Name | ||||||
| merged_data_pathogens <- left_join( | ||||||
| merged_data_antibiotics, | ||||||
| pathogens_data, | ||||||
| by = c("Pathogen" = "Abbreviation") | ||||||
| ) %>% | ||||||
| rename(Pathogen_Full_Name = Pathogen.y) | ||||||
|
|
||||||
| # Assign "Multi-species" to Pathogen_Full_Name where Pathogen values are "MULTI" | ||||||
| merged_data_pathogens <- merged_data_pathogens %>% | ||||||
| mutate(Pathogen_Full_Name = if_else(Pathogen == "MULTI", "Multi-species", Pathogen_Full_Name)) | ||||||
|
|
||||||
|
|
||||||
| # Assign "Multi-class" to Molecule where Drug values are full names (not abbreviations) | ||||||
| merged_data_pathogens <- merged_data_pathogens %>% | ||||||
| mutate(Molecule = if_else(grepl(" ", Drug) | grepl("-", Drug), "Multi-class", Molecule)) | ||||||
|
|
||||||
|
|
||||||
| #FASTA sequences | ||||||
| #Install and Load required packages | ||||||
| if (!requireNamespace("rentrez", quietly = TRUE)) { | ||||||
| install.packages("rentrez") | ||||||
| } | ||||||
| if (!requireNamespace("XML", quietly = TRUE)) { | ||||||
| install.packages("XML") | ||||||
| } | ||||||
| if (!requireNamespace("stringr", quietly = TRUE)) { | ||||||
| install.packages("stringr") | ||||||
| } | ||||||
|
|
||||||
|
|
||||||
| library(rentrez) | ||||||
| library(XML) | ||||||
| library(stringr) | ||||||
|
|
||||||
| # Filter for the target drug (DAP) and pathogen (Staphylococcus aureus) | ||||||
| filter_resistance_mechanisms <- function(data, drug, bug, exclude_multiclass = FALSE, species_restricted = TRUE) { | ||||||
|
|
||||||
| # Filter by drug using partial match to include multiclass entries containing the target drug | ||||||
| filtered_data <- data %>% | ||||||
| filter(grepl(drug, Drug, ignore.case = TRUE)) | ||||||
|
|
||||||
| # Filter by pathogen, using partial match | ||||||
| filtered_data <- filtered_data %>% | ||||||
| filter(grepl(bug, Pathogen_Full_Name, ignore.case = TRUE)) | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if using snake_case, there will be no caps as in Pathogen_Full_Name. |
||||||
|
|
||||||
| # Optionally exclude multiclass resistance if exclude_multiclass = TRUE | ||||||
| if (exclude_multiclass) { | ||||||
| filtered_data <- filtered_data %>% | ||||||
| filter(!grepl(";", Drug)) # Only include entries with single drug classes | ||||||
| } | ||||||
|
|
||||||
| # Optionally restrict to species-specific mechanisms if species_restricted = TRUE | ||||||
| if (species_restricted) { | ||||||
| filtered_data <- filtered_data %>% | ||||||
| filter(Pathogen_Full_Name == bug) # Include only entries with exact match to the bug of interest | ||||||
| } | ||||||
|
|
||||||
| return(filtered_data) | ||||||
| } | ||||||
|
|
||||||
| # Usage example for Staphylococcus aureus resistant to DAP, including multispecies and multiclass resistance | ||||||
| filtered_data_saurdap <- filter_resistance_mechanisms( | ||||||
| data = merged_data_pathogens, | ||||||
| drug = "DAP", | ||||||
| bug = "Staphylococcus aureus", | ||||||
| exclude_multiclass = FALSE, | ||||||
| species_restricted = FALSE | ||||||
| ) | ||||||
|
|
||||||
| # View the filtered results | ||||||
| View(filtered_data_saurdap) | ||||||
|
|
||||||
|
|
||||||
| # Fetch FASTA sequence from Entrez | ||||||
| fetch_fasta_sequence <- function(protein_accession) { | ||||||
| tryCatch({ | ||||||
| # Fetch the FASTA sequence using Entrez | ||||||
| fasta_seq <- rentrez::entrez_fetch(db = "protein", | ||||||
| id = protein_accession, | ||||||
| rettype = "fasta", | ||||||
| retmode = "text") | ||||||
|
|
||||||
| if (!is.null(fasta_seq)) { | ||||||
| # Ensure the first line starts with ">" | ||||||
| if (!grepl("^>", fasta_seq[1])) { | ||||||
| fasta_seq[1] <- paste0(">", fasta_seq[1]) | ||||||
| } | ||||||
|
|
||||||
| # Split the sequence into lines | ||||||
| lines <- str_split(fasta_seq, "\n")[[1]] | ||||||
|
|
||||||
| # Join the lines back together | ||||||
| fasta_seq <- paste(lines, collapse = "\n") | ||||||
|
|
||||||
| return(fasta_seq) | ||||||
| } else { | ||||||
| warning(paste("Failed to retrieve FASTA sequence for protein accession:", protein_accession)) | ||||||
| return(NULL) | ||||||
| } | ||||||
| }, error = function(e) { | ||||||
| warning(paste("Error fetching FASTA sequence for protein accession:", protein_accession, ":", e$message)) | ||||||
| return(NULL) | ||||||
| }) | ||||||
| } | ||||||
|
|
||||||
|
|
||||||
| # Define the output file for the FASTA sequences | ||||||
| output_fasta_file <- "Staph_aureus_Daptomycin_sequences.fasta" | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If using short names for species (4-char) and drugs (antibiotics, 3-char).
Suggested change
|
||||||
|
|
||||||
| # Initialize an empty character vector to store the sequences | ||||||
| combined_sequences <- character() | ||||||
|
|
||||||
| # Loop through each Protein Accession in the filtered data to fetch sequences | ||||||
| for (i in 1:nrow(filtered_data_saurdap)) { | ||||||
| # Get the Protein Accession ID | ||||||
| Protein_accession <- filtered_data_saurdap$Protein_Accession[i] | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Confusing alternating use of Protein_ vs. protein_accession. 🤔 |
||||||
|
|
||||||
| cat("Fetching sequence for Protein Accession:", protein_accession, "\n") # Debugging message | ||||||
|
|
||||||
| # Fetch the FASTA sequence | ||||||
| fasta_sequence <- fetch_fasta_sequence(protein_accession) | ||||||
|
|
||||||
| # If the sequence was fetched successfully, add it to the combined_sequences vector | ||||||
| if (!is.null(fasta_sequence)) { | ||||||
| combined_sequences <- c(combined_sequences, fasta_sequence) | ||||||
| cat("Successfully fetched sequence for:", protein_accession, "\n") | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if this is for multiple or single accession numbers. change accordingly?
Suggested change
|
||||||
| } else { | ||||||
| cat("Failed to fetch sequence for:", protein_accession, "\n") | ||||||
| } | ||||||
| } | ||||||
|
|
||||||
| # Check if there are any fetched sequences | ||||||
| if (length(combined_sequences) > 0) { | ||||||
| # Save all fetched sequences to a FASTA file | ||||||
| writeLines(combined_sequences, output_fasta_file) | ||||||
| cat("Sequences saved to", output_fasta_file, "\n") | ||||||
| } else { | ||||||
| cat("No sequences were fetched, so no FASTA file was created.\n") | ||||||
| } | ||||||
|
|
||||||
| # Read the contents of the file | ||||||
| fasta_contents <- readLines(output_fasta_file) | ||||||
|
|
||||||
| # Print the contents | ||||||
| cat(fasta_contents, sep = "\n") | ||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Cateline, thanks for adding this README. Out of curiosity, are these descriptions already paraphrased from the original source (CARD), or yet to be? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The descriptions are from the original source (CARD) and have not been paraphrased yet |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,32 @@ | ||||||
| # CARD README | ||||||
|
|
||||||
| ## Source: | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| This dataset was downloaded from the Comprehensive Antibiotic Resistance Database (CARD) in 2024-10 at https://card.mcmaster.ca/download/0/broadstreet-v3.3.0.tar.bz2 | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
|
|
||||||
| CITATION: | ||||||
|
|
||||||
| Alcock et al. 2023. "CARD 2023: expanded curation, support for machine learning, and resistome | ||||||
| prediction at the Comprehensive Antibiotic Resistance Database" Nucleic Acids Research, | ||||||
| 51, D690-D699. https://pubmed.ncbi.nlm.nih.gov/36263822/ | ||||||
|
|
||||||
| ## CARD SHORT NAMES | ||||||
|
|
||||||
| The CARD database uses standardized abbreviations, known as CARD Short Names, for AMR gene names associated with Antibiotic Resistance Ontology terms. These names are created for compatibility across data files and outputs from the Resistance Gene Identifier (RGI). Short Names for genes with 15 or fewer characters retain the original gene name, while longer names are abbreviated to uniquely represent each gene or protein. All CARD Short Names replace whitespace with underscores. For pathogen names, CARD follows the convention of capitalizing the first letter of the genus followed by the first three letters of the species in lowercase. Where applicable, CARD Short Names adopt formats such as “pathogen_gene,” “pathogen_gene_drug,” or “gene_drug.” Full lists of these abbreviations are available in the provided files: | ||||||
|
|
||||||
| shortname_antibiotics.tsv | ||||||
| shortname_pathogens.tsv" | ||||||
|
|
||||||
|
|
||||||
| ## FASTA | ||||||
|
|
||||||
| The FASTA files included here contain retrieved sequences of antimicrobial resistance genes. | ||||||
|
|
||||||
| ## Data Files Downloaded | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| aro_index.tsv | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| This file contains an index of ARO (Antibiotic Resistance Ontology) identifiers with associated GenBank accessions. Each entry includes information used to link antibiotic resistance genes to GenBank sequences. | ||||||
| shortname_antibiotics.tsv | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| Contains standardized abbreviations for antibiotics used in CARD’s short names. These abbreviations, which follow conventions from the American Society for Microbiology (ASM) and additional custom terms, provide a uniform naming system for antibiotics referenced within CARD data. | ||||||
|
|
||||||
| shortname_pathogens.tsv | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| Lists standardized abbreviations for pathogens used in CARD. Each abbreviation represents pathogen names in a condensed format, commonly the first letter of the genus followed by the first three letters of the species. This abbreviation system simplifies pathogen referencing in CARD outputs. | ||||||
Uh oh!
There was an error while loading. Please reload this page.