Skip to content
Draft
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
cfc7bf6
Fixes Phase 1 of Issue #27 and Issue #103
Cateline Oct 20, 2024
9615773
Added link to MolEvolvR Case Study report. Fixes Phase 2 of Issue #27
Cateline Oct 21, 2024
9f06bb2
Delete unnecessary files
Cateline Oct 22, 2024
69916b8
Remove unnecessary CARD data files
Cateline Oct 22, 2024
0a3572e
Remove unnecessary CARD data files
Cateline Oct 22, 2024
08ed58f
Remove unnecessary CARD data files
Cateline Oct 22, 2024
a2643f1
Remove unnecessary CARD data files
Cateline Oct 22, 2024
9be2e3b
Remove unnecessary CARD data files
Cateline Oct 22, 2024
b0dbb23
Remove unnecessary CARD data files
Cateline Oct 22, 2024
8ddf883
Remove unnecessary CARD data files
Cateline Oct 22, 2024
b0c5dfa
Remove unnecessary CARD data files
Cateline Oct 22, 2024
2eb20ce
Remove unnecessary CARD data files
Cateline Oct 22, 2024
a532154
Remove unnecessary CARD data files
Cateline Oct 22, 2024
7aa8917
Remove unnecessary CARD data files
Cateline Oct 22, 2024
52ce540
Update case_studies/CARD/Bug-Drug Code.R
Cateline Oct 22, 2024
4177654
Update case_studies/CARD/Bug-Drug Code.R
Cateline Oct 22, 2024
444b520
Update Bug-Drug Code.R
Cateline Oct 24, 2024
e223f86
Add HTML report file to reports folder
Cateline Oct 24, 2024
56addcc
Delete case_studies/CARD/reports/download.htm
Cateline Oct 24, 2024
f2af6f4
Add HTML Report File
Cateline Oct 24, 2024
f590d94
Update case_studies/CARD/CARD_data/CARD-Download-README.txt
Cateline Oct 25, 2024
5d174be
Update case_studies/CARD/CARD_data/CARD-Download-README.txt
Cateline Oct 25, 2024
54e7b5b
Update case_studies/CARD/CARD_data/CARD-Download-README.txt
Cateline Oct 25, 2024
1195e1e
Update case_studies/CARD/CARD_data/CARD-Download-README.txt
Cateline Oct 25, 2024
2d80ab5
Update case_studies/CARD/CARD_data/CARD-Download-README.txt
Cateline Oct 25, 2024
b709416
Update CARD-Download-README.txt
Cateline Oct 25, 2024
eca5d37
Rename Staph_aureus_Daptomycin_sequences5.fasta to Staph_aureus_Dapto…
Cateline Oct 25, 2024
993bc09
Update Bug-Drug Code.R
Cateline Oct 27, 2024
ab67c1c
Update Bug-Drug Code.R
Cateline Oct 27, 2024
13a6e8b
Enhance logic for determining pathogen, gene, and drug fields
Cateline Oct 31, 2024
9a7688d
Enhance data mapping logic
Cateline Nov 1, 2024
14992a3
Add function to fetch and save protein FASTA sequences from Entrez
Cateline Nov 1, 2024
e105319
Update Bug-Drug Code.R
Cateline Nov 1, 2024
f6b87e7
Update case_studies/CARD/Bug-Drug Code.R
Cateline Nov 1, 2024
bbb8c91
Update case_studies/CARD/Bug-Drug Code.R
Cateline Nov 1, 2024
8e68be7
Update Bug-Drug Code.R
Cateline Nov 6, 2024
8afcba8
Update Bug-Drug Code.R
Cateline Nov 6, 2024
bcbd971
Refactor drug-pathogen filtering to support multiple drug classes and…
Cateline Nov 13, 2024
aee86b7
Data Cleanup Comparison
Cateline Nov 24, 2024
1dc5c81
Automate Case-Studies Issue #27
Cateline Nov 24, 2024
4ddc8e1
Rename Bug-Drug Code.R to bug_drug.R
jananiravi Nov 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
257 changes: 257 additions & 0 deletions case_studies/CARD/Bug-Drug Code.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,257 @@
# config.R
url <- "https://card.mcmaster.ca/download/0/broadstreet-v3.3.0.tar.bz2"
destfile <- "broadstreet-v3.3.0.tar.bz2"

# Download the file
download.file(url, destfile)

#Extract the file
if (!require("R.utils")) {
install.packages("R.utils")
library(R.utils)
}


# Extract the tar file
untar("broadstreet-v3.3.0.tar.bz2", exdir = "CARD_data")


# Map CARD Short Name

# Parse the required files using readr::read_delim
aro_index <- read_delim("CARD_data/aro_index.tsv", delim = "\t", col_names = TRUE)
antibiotics_data <- read_delim("CARD_data/shortname_antibiotics.tsv", delim = "\t", col_names = TRUE)
pathogens_data <- read_delim("CARD_data/shortname_pathogens.tsv", delim = "\t", col_names = TRUE)



# Extract pathogen, gene, drug, and include Protein.Accession from 'CARD Short Name'
extract_card_info <- function(card_short_name, drug_class, `Protein Accession`, `DNA Accession`) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename all colnames with spaces and special characters to now include only _. Also avoid multiple cases.

@AbhirupaGhosh @charmvang @awasyn @epbrenner @the-mayer -- using camelCase for colnames or snake_case (without caps)?

# Split the CARD Short Name by underscores
split_names <- unlist(strsplit(card_short_name, "_"))

# Initialize variables with defaults
pathogen <- NA
gene <- NA
drug <- drug_class # Default to Drug Class column

# Determine the information based on the split names and patterns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you share an example file (snippet pre and post name cleanup)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you share an example file (snippet pre and post name cleanup)?

Hello @jananiravi , by this do you mean I should use the View() function in R to allow for the visual inspection of the dataset before and after processing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I meant snapshots or example data stored locally (as part of the commit) to be able to run the code and check locally.

if (length(split_names) == 1) {
# Gene only (single part entry)
gene <- split_names[1]
pathogen <- "MULTI" # Assign MULTI as default for pathogen
} else if (all(toupper(split_names) == split_names)) {
# Gene complex (all uppercase entries)
gene <- card_short_name # Entire entry as gene
pathogen <- "MULTI"
} else if (length(split_names) == 2) {
# Pathogen-Gene scenario
pathogen <- split_names[1]
gene <- split_names[2]
} else if (length(split_names) == 3) {
# Pathogen-Gene-Drug scenario
pathogen <- split_names[1]
gene <- split_names[2]
drug <- split_names[3] # Assign drug from the split entry
}

# If both pathogen and gene are NA, classify as complex gene
if (is.na(pathogen) && is.na(gene)) {
gene <- card_short_name # Assign entire CARD Short Name as gene
pathogen <- "MULTI" # Default to MULTI for pathogen
}

# Handle Protein Accession
if (is.na(`Protein Accession`) || `Protein Accession` == "") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if renamed above, there will be no colnames with spaces

`Protein Accession` <- `DNA Accession` # Use DNA Accession if Protein Accession is NA
}

return(list(Pathogen = pathogen, Gene = gene, Drug = drug, Protein_Accession = `Protein Accession`))
}

# Apply the function to the data frame
resistance_profile_data <- aro_index %>%
mutate(extracted_info = pmap(list(`CARD Short Name`, `Drug Class`, `Protein Accession`, `DNA Accession`),
extract_card_info)) %>%
unnest_wider(extracted_info)

# View the resulting data frame
print(resistance_profile_data)

# Define a relative path for saving the data
output_path <- file.path("CARD_data", "resistance_profile_data.tsv")

# Save resistance_profile_data to the specified path
write_delim(resistance_profile_data, output_path, delim = "\t")

# Load data
resistance_profile_data <- read_delim(output_path, delim = "\t", col_names = TRUE)
antibiotics_data <- read_delim("CARD_data/shortname_antibiotics.tsv", delim = "\t", col_names = TRUE)
pathogens_data <- read_delim("CARD_data/shortname_pathogens.tsv", delim = "\t", col_names = TRUE)


# Merge the extracted resistance profile data with antibiotics_data on Drug
merged_data_antibiotics <- left_join(
resistance_profile_data,
antibiotics_data,
by = c("Drug" = "AAC Abbreviation"), # Adjusting for abbreviations between datasets
relationship = "many-to-many"
)

# Merge the result with pathogens_data on Pathogen, renaming Pathogen.y to Pathogen_Full_Name
merged_data_pathogens <- left_join(
merged_data_antibiotics,
pathogens_data,
by = c("Pathogen" = "Abbreviation")
) %>%
rename(Pathogen_Full_Name = Pathogen.y)

# Assign "Multi-species" to Pathogen_Full_Name where Pathogen values are "MULTI"
merged_data_pathogens <- merged_data_pathogens %>%
mutate(Pathogen_Full_Name = if_else(Pathogen == "MULTI", "Multi-species", Pathogen_Full_Name))


# Assign "Multi-class" to Molecule where Drug values are full names (not abbreviations)
merged_data_pathogens <- merged_data_pathogens %>%
mutate(Molecule = if_else(grepl(" ", Drug) | grepl("-", Drug), "Multi-class", Molecule))


#FASTA sequences
#Install and Load required packages
if (!requireNamespace("rentrez", quietly = TRUE)) {
install.packages("rentrez")
}
if (!requireNamespace("XML", quietly = TRUE)) {
install.packages("XML")
}
if (!requireNamespace("stringr", quietly = TRUE)) {
install.packages("stringr")
}


library(rentrez)
library(XML)
library(stringr)

# Filter for the target drug (DAP) and pathogen (Staphylococcus aureus)
filter_resistance_mechanisms <- function(data, drug, bug, exclude_multiclass = FALSE, species_restricted = TRUE) {

# Filter by drug using partial match to include multiclass entries containing the target drug
filtered_data <- data %>%
filter(grepl(drug, Drug, ignore.case = TRUE))

# Filter by pathogen, using partial match
filtered_data <- filtered_data %>%
filter(grepl(bug, Pathogen_Full_Name, ignore.case = TRUE))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if using snake_case, there will be no caps as in Pathogen_Full_Name.


# Optionally exclude multiclass resistance if exclude_multiclass = TRUE
if (exclude_multiclass) {
filtered_data <- filtered_data %>%
filter(!grepl(";", Drug)) # Only include entries with single drug classes
}

# Optionally restrict to species-specific mechanisms if species_restricted = TRUE
if (species_restricted) {
filtered_data <- filtered_data %>%
filter(Pathogen_Full_Name == bug) # Include only entries with exact match to the bug of interest
}

return(filtered_data)
}

# Usage example for Staphylococcus aureus resistant to DAP, including multispecies and multiclass resistance
filtered_data_saurdap <- filter_resistance_mechanisms(
data = merged_data_pathogens,
drug = "DAP",
bug = "Staphylococcus aureus",
exclude_multiclass = FALSE,
species_restricted = FALSE
)

# View the filtered results
View(filtered_data_saurdap)


# Fetch FASTA sequence from Entrez
fetch_fasta_sequence <- function(protein_accession) {
tryCatch({
# Fetch the FASTA sequence using Entrez
fasta_seq <- rentrez::entrez_fetch(db = "protein",
id = protein_accession,
rettype = "fasta",
retmode = "text")

if (!is.null(fasta_seq)) {
# Ensure the first line starts with ">"
if (!grepl("^>", fasta_seq[1])) {
fasta_seq[1] <- paste0(">", fasta_seq[1])
}

# Split the sequence into lines
lines <- str_split(fasta_seq, "\n")[[1]]

# Join the lines back together
fasta_seq <- paste(lines, collapse = "\n")

return(fasta_seq)
} else {
warning(paste("Failed to retrieve FASTA sequence for protein accession:", protein_accession))
return(NULL)
}
}, error = function(e) {
warning(paste("Error fetching FASTA sequence for protein accession:", protein_accession, ":", e$message))
return(NULL)
})
}


# Define the output file for the FASTA sequences
output_fasta_file <- "Staph_aureus_Daptomycin_sequences.fasta"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using short names for species (4-char) and drugs (antibiotics, 3-char).
arg = antibiotic resistance genes, for example.
Which shortnames are you planning to use?
cc: @AbhirupaGhosh @charmvang @awasyn @epbrenner

Suggested change
output_fasta_file <- "Staph_aureus_Daptomycin_sequences.fasta"
output_fasta_file <- "Saur_dap_arg.fasta"


# Initialize an empty character vector to store the sequences
combined_sequences <- character()

# Loop through each Protein Accession in the filtered data to fetch sequences
for (i in 1:nrow(filtered_data_saurdap)) {
# Get the Protein Accession ID
Protein_accession <- filtered_data_saurdap$Protein_Accession[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confusing alternating use of Protein_ vs. protein_accession. 🤔


cat("Fetching sequence for Protein Accession:", protein_accession, "\n") # Debugging message

# Fetch the FASTA sequence
fasta_sequence <- fetch_fasta_sequence(protein_accession)

# If the sequence was fetched successfully, add it to the combined_sequences vector
if (!is.null(fasta_sequence)) {
combined_sequences <- c(combined_sequences, fasta_sequence)
cat("Successfully fetched sequence for:", protein_accession, "\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is for multiple or single accession numbers. change accordingly?

Suggested change
cat("Successfully fetched sequence for:", protein_accession, "\n")
cat("Successfully fetched sequences for:", protein_accession, "\n")

} else {
cat("Failed to fetch sequence for:", protein_accession, "\n")
}
}

# Check if there are any fetched sequences
if (length(combined_sequences) > 0) {
# Save all fetched sequences to a FASTA file
writeLines(combined_sequences, output_fasta_file)
cat("Sequences saved to", output_fasta_file, "\n")
} else {
cat("No sequences were fetched, so no FASTA file was created.\n")
}

# Read the contents of the file
fasta_contents <- readLines(output_fasta_file)

# Print the contents
cat(fasta_contents, sep = "\n")











Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

32 changes: 32 additions & 0 deletions case_studies/CARD/CARD_data/CARD-Download-README.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Cateline, thanks for adding this README. Out of curiosity, are these descriptions already paraphrased from the original source (CARD), or yet to be?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The descriptions are from the original source (CARD) and have not been paraphrased yet

Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# CARD README

## Source:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Source:
## Source

This dataset was downloaded from the Comprehensive Antibiotic Resistance Database (CARD) in 2024-10 at https://card.mcmaster.ca/download/0/broadstreet-v3.3.0.tar.bz2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This dataset was downloaded from the Comprehensive Antibiotic Resistance Database (CARD) in 2024-10 at https://card.mcmaster.ca/download/0/broadstreet-v3.3.0.tar.bz2
This dataset and associated README were downloaded from the Comprehensive Antibiotic Resistance Database (CARD) (2024-10) at https://card.mcmaster.ca/download/0/broadstreet-v3.3.0.tar.bz2.



CITATION:

Alcock et al. 2023. "CARD 2023: expanded curation, support for machine learning, and resistome
prediction at the Comprehensive Antibiotic Resistance Database" Nucleic Acids Research,
51, D690-D699. https://pubmed.ncbi.nlm.nih.gov/36263822/

## CARD SHORT NAMES

The CARD database uses standardized abbreviations, known as CARD Short Names, for AMR gene names associated with Antibiotic Resistance Ontology terms. These names are created for compatibility across data files and outputs from the Resistance Gene Identifier (RGI). Short Names for genes with 15 or fewer characters retain the original gene name, while longer names are abbreviated to uniquely represent each gene or protein. All CARD Short Names replace whitespace with underscores. For pathogen names, CARD follows the convention of capitalizing the first letter of the genus followed by the first three letters of the species in lowercase. Where applicable, CARD Short Names adopt formats such as “pathogen_gene,” “pathogen_gene_drug,” or “gene_drug.” Full lists of these abbreviations are available in the provided files:

shortname_antibiotics.tsv
shortname_pathogens.tsv"


## FASTA

The FASTA files included here contain retrieved sequences of antimicrobial resistance genes.

## Data Files Downloaded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Data Files Downloaded
## Data files downloaded

aro_index.tsv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
aro_index.tsv
`aro_index.tsv`

This file contains an index of ARO (Antibiotic Resistance Ontology) identifiers with associated GenBank accessions. Each entry includes information used to link antibiotic resistance genes to GenBank sequences.
shortname_antibiotics.tsv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
shortname_antibiotics.tsv
`shortname_antibiotics.tsv`

Contains standardized abbreviations for antibiotics used in CARD’s short names. These abbreviations, which follow conventions from the American Society for Microbiology (ASM) and additional custom terms, provide a uniform naming system for antibiotics referenced within CARD data.

shortname_pathogens.tsv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
shortname_pathogens.tsv
`shortname_pathogens.tsv`

Lists standardized abbreviations for pathogens used in CARD. Each abbreviation represents pathogen names in a condensed format, commonly the first letter of the genus followed by the first three letters of the species. This abbreviation system simplifies pathogen referencing in CARD outputs.
Loading