Skip to content
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
cfc7bf6
Fixes Phase 1 of Issue #27 and Issue #103
Cateline Oct 20, 2024
9615773
Added link to MolEvolvR Case Study report. Fixes Phase 2 of Issue #27
Cateline Oct 21, 2024
9f06bb2
Delete unnecessary files
Cateline Oct 22, 2024
69916b8
Remove unnecessary CARD data files
Cateline Oct 22, 2024
0a3572e
Remove unnecessary CARD data files
Cateline Oct 22, 2024
08ed58f
Remove unnecessary CARD data files
Cateline Oct 22, 2024
a2643f1
Remove unnecessary CARD data files
Cateline Oct 22, 2024
9be2e3b
Remove unnecessary CARD data files
Cateline Oct 22, 2024
b0dbb23
Remove unnecessary CARD data files
Cateline Oct 22, 2024
8ddf883
Remove unnecessary CARD data files
Cateline Oct 22, 2024
b0c5dfa
Remove unnecessary CARD data files
Cateline Oct 22, 2024
2eb20ce
Remove unnecessary CARD data files
Cateline Oct 22, 2024
a532154
Remove unnecessary CARD data files
Cateline Oct 22, 2024
7aa8917
Remove unnecessary CARD data files
Cateline Oct 22, 2024
52ce540
Update case_studies/CARD/Bug-Drug Code.R
Cateline Oct 22, 2024
4177654
Update case_studies/CARD/Bug-Drug Code.R
Cateline Oct 22, 2024
444b520
Update Bug-Drug Code.R
Cateline Oct 24, 2024
e223f86
Add HTML report file to reports folder
Cateline Oct 24, 2024
56addcc
Delete case_studies/CARD/reports/download.htm
Cateline Oct 24, 2024
f2af6f4
Add HTML Report File
Cateline Oct 24, 2024
f590d94
Update case_studies/CARD/CARD_data/CARD-Download-README.txt
Cateline Oct 25, 2024
5d174be
Update case_studies/CARD/CARD_data/CARD-Download-README.txt
Cateline Oct 25, 2024
54e7b5b
Update case_studies/CARD/CARD_data/CARD-Download-README.txt
Cateline Oct 25, 2024
1195e1e
Update case_studies/CARD/CARD_data/CARD-Download-README.txt
Cateline Oct 25, 2024
2d80ab5
Update case_studies/CARD/CARD_data/CARD-Download-README.txt
Cateline Oct 25, 2024
b709416
Update CARD-Download-README.txt
Cateline Oct 25, 2024
eca5d37
Rename Staph_aureus_Daptomycin_sequences5.fasta to Staph_aureus_Dapto…
Cateline Oct 25, 2024
993bc09
Update Bug-Drug Code.R
Cateline Oct 27, 2024
ab67c1c
Update Bug-Drug Code.R
Cateline Oct 27, 2024
13a6e8b
Enhance logic for determining pathogen, gene, and drug fields
Cateline Oct 31, 2024
9a7688d
Enhance data mapping logic
Cateline Nov 1, 2024
14992a3
Add function to fetch and save protein FASTA sequences from Entrez
Cateline Nov 1, 2024
e105319
Update Bug-Drug Code.R
Cateline Nov 1, 2024
f6b87e7
Update case_studies/CARD/Bug-Drug Code.R
Cateline Nov 1, 2024
bbb8c91
Update case_studies/CARD/Bug-Drug Code.R
Cateline Nov 1, 2024
8e68be7
Update Bug-Drug Code.R
Cateline Nov 6, 2024
8afcba8
Update Bug-Drug Code.R
Cateline Nov 6, 2024
bcbd971
Refactor drug-pathogen filtering to support multiple drug classes and…
Cateline Nov 13, 2024
aee86b7
Data Cleanup Comparison
Cateline Nov 24, 2024
1dc5c81
Automate Case-Studies Issue #27
Cateline Nov 24, 2024
4ddc8e1
Rename Bug-Drug Code.R to bug_drug.R
jananiravi Nov 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions case_studies/CARD/Bug-Drug Code.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# config.R
url <- "https://card.mcmaster.ca/download/0/broadstreet-v3.3.0.tar.bz2"
destfile <- "broadstreet-v3.3.0.tar.bz2"

# Download the file
download.file(url, destfile)

#Extract the file
install.packages("R.utils")
library(R.utils)

# Decompress the file
bunzip2("broadstreet-v3.3.0.tar.bz2", destname = "broadstreet-v3.3.0.tar", remove = FALSE)
file.rename("broadstreet-v3.3.0.tar", "broadstreet-v3.3.0_old.tar")

# Extract the tar file
untar("broadstreet-v3.3.0_old.tar", exdir = "CARD_data")

# List the contents of the extraction directory
list.files("CARD_data")

# Parse the ARO_index.tsv file using read.delim
aro_index <- read.delim("CARD_data/ARO_index.tsv", header = TRUE, sep = "\t")


# Map CARD Short Name
# Load necessary library
library(dplyr)

# Read the files
aro_index <- read.delim("CARD_data/aro_index.tsv", sep = "\t", header = TRUE)
antibiotics_data <- read.delim("CARD_data/shortname_antibiotics.tsv", sep = "\t", header = TRUE)
pathogens_data <- read.delim("CARD_data/shortname_pathogens.tsv", sep = "\t", header = TRUE)


# Mutate data
aro_index <- aro_index %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step yields a lot of NA values, so just clarifying that that's intended. Multi-pathogen genes like aadA or acrB don't parse into the "pathogen / gene / drug" pattern successfully, so you have things like

pathogen,gene,drug
Abau,ampC,BLA
Abau,Abaf,NA
acrB, NA, NA

mutate(
pathogen = sapply(strsplit(CARD.Short.Name, "_"), `[`, 1), # First part: Pathogen
gene = sapply(strsplit(CARD.Short.Name, "_"), `[`, 2), # Second part: Gene
drug = ifelse(sapply(strsplit(CARD.Short.Name, "_"), length) == 3, # Third part: Drug
sapply(strsplit(CARD.Short.Name, "_"), `[`, 3), NA),
Protein.Accession = Protein.Accession # Include existing Protein.Accession column
)

# View the mutated data
head(aro_index)


# Extract pathogen, gene, drug, and include Protein.Accession from 'CARD.Short.Name'
aro_index_clean <- aro_index %>%
mutate(
pathogen = sapply(strsplit(CARD.Short.Name, "_"), `[`, 1), # Extract pathogen
gene = sapply(strsplit(CARD.Short.Name, "_"), `[`, 2), # Extract gene
drug = ifelse(sapply(strsplit(CARD.Short.Name, "_"), length) == 3, # Extract drug
sapply(strsplit(CARD.Short.Name, "_"), `[`, 3), NA),
Protein.Accession = Protein.Accession # Include the Protein.Accession column
)

# Merge aro_index_clean with the antibiotics_data and pathogens_data
# For merging with antibiotics_data
merged_data_antibiotics <- left_join(aro_index_clean, antibiotics_data,
by = c("drug" = "AAC.Abbreviation"))

# For merging with pathogens_data
merged_data_pathogens <- left_join(merged_data_antibiotics, pathogens_data,
by = c("pathogen" = "Abbreviation"))

# View the resulting merged data
head(merged_data_pathogens)


#filter out rows where pathogen is empty
cleaned_data <- merged_data_pathogens %>%
distinct() %>%
filter(!is.na(Pathogen)) # Use 'Pathogen' instead of 'pathogen'
View(cleaned_data)

# Group by Pathogen, Gene, Drug, and Protein.Accession, then summarize Antibiotic information
summarized_data <- cleaned_data %>%
group_by(Pathogen = Pathogen, Gene = gene, Drug = drug, Protein_Accession = Protein.Accession) %>%
summarize(Antibiotic_Info = paste(unique(Molecule), collapse = ", ")) %>%
arrange(Pathogen, Gene, Drug, Protein_Accession)

# Filter for Staphylococcus aureus and DAP (Bug-Drug of Interest)
staph_aureus_dap_combinations <- summarized_data %>%
filter(Pathogen == "Staphylococcus aureus", Drug == "DAP")

# View the filtered data
head(staph_aureus_dap_combinations)


#Fetch FASTA sequences from Entrez using protein accession
#Load required packages
library(rentrez)
library(XML)
library(stringr)


# Fetch FASTA sequence from Entrez
fetch_fasta_sequence <- function(protein_accession) {
tryCatch({
# Fetch the FASTA sequence using Entrez
fasta_seq <- rentrez::entrez_fetch(db = "protein",
id = protein_accession,
rettype = "fasta",
retmode = "text")

if (!is.null(fasta_seq)) {
# Ensure the first line starts with ">"
if (!grepl("^>", fasta_seq[1])) {
fasta_seq[1] <- paste0(">", fasta_seq[1])
}

# Split the sequence into lines
lines <- str_split(fasta_seq, "\n")[[1]]

# Join the lines back together
fasta_seq <- paste(lines, collapse = "\n")

return(fasta_seq)
} else {
warning(paste("Failed to retrieve FASTA sequence for protein accession:", protein_accession))
return(NULL)
}
}, error = function(e) {
warning(paste("Error fetching FASTA sequence for protein accession:", protein_accession, ":", e$message))
return(NULL)
})
}

# Loop through staph_aureus_dap_combinations to fetch and save FASTA sequences
combined_sequences <- character()

for (i in 1:nrow(staph_aureus_dap_combinations)) {
# Fetch FASTA sequence for each protein accession
protein_accession <- staph_aureus_dap_combinations$Protein_Accession[i]
fasta_sequence <- fetch_fasta_sequence(protein_accession)

if (!is.null(fasta_sequence)) {
combined_sequences <- c(combined_sequences, fasta_sequence)
}
}

# Save the combined FASTA sequences
filename <- "Staph_aureus_Daptomycin_sequences5.fasta"

writeLines(combined_sequences, filename)

# Read the FASTA file
fasta_content <- readLines(filename)

# Display the contents
cat(fasta_content, sep = "\n")
70 changes: 70 additions & 0 deletions case_studies/CARD/CARD_data/CARD-Download-README.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Cateline, thanks for adding this README. Out of curiosity, are these descriptions already paraphrased from the original source (CARD), or yet to be?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The descriptions are from the original source (CARD) and have not been paraphrased yet

Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
CARD Download README

Use or reproduction of these materials, in whole or in part, by any commercial
organization whether or not for non-commercial (including research) or commercial purposes
is prohibited, except with written permission of McMaster University. Commercial uses are
offered only pursuant to a written license and user fee. To obtain permission and begin
the licensing process, see http://card.mcmaster.ca/about.

CITATION:

Alcock et al. 2023. "CARD 2023: expanded curation, support for machine learning, and resistome
prediction at the Comprehensive Antibiotic Resistance Database" Nucleic Acids Research,
51, D690-D699. https://pubmed.ncbi.nlm.nih.gov/36263822/

CARD SHORT NAMES:

A CARD-specific abbreviation for AMR gene names associated with Antibiotic Resistance
Ontology terms, often not based on the literature. This is used for programmatic and
compatibility purposes and is not ontologically relevant. Each ontology term with an
associated AMR detection model has a CARD Short Name that appears in CARD data files
and output generated by RGI. If the original gene name is less than 15 characters, the
CARD short name is identical; if the gene name is greater than 15 characters, the CARD
Short Name has been abbreviated by CARD curators specifically to identify the proper
gene or protein name. All CARD Short Names are unique and have whitespace characters
replaced by underscore characters. The convention for pathogen names is capitalized
first letter of the genus followed by the lowercase first three letters of the species
name. The antibiotic abbreviations are from https://journals.asm.org/journal/aac/abbreviations
plus some custom abbreviations by the CARD curators. Simple CARD Short Names often do not
involve either, e.g. CTX-M-15, but where applicable the CARD Short Names follow pathogen_gene
or pathogen_gene_drug. The full lists of abbreviations can be found in the enclosed files:

"shortname_antibiotics.tsv"
"shortname_pathogens.tsv"

FASTA:

Nucleotide and corresponding protein FASTA downloads are available as separate files for
each model type. For example, the "protein homolog" model type contains sequences of
antimicrobial resistance genes that do not include mutation as a determinant of resistance
- these data are appropriate for BLAST analysis of metagenomic data or searches excluding
secondary screening for resistance mutations. In contrast, the "protein variant" model
includes reference wild type sequences used for mapping SNPs conferring antimicrobial
resistance - without secondary mutation screening, analyses using these data will include
false positives for antibiotic resistant gene variants or mutants.

MODELS:

The file "card.json" contains the complete data for all of CARD's AMR detection models,
including reference sequences, SNP mapping data, model parameters, and ARO classification.
"card.json" is used by the Resistance Gene Identifier software.

Values for "High Confidence TB", "Moderate Confidence TB", "Minimal Confidence TB", and
"Indeterminate Confidence TB" were obtained from https://platform.reseqtb.org.

INDEX FILES:

The file "aro_index.tsv" contains a list of ARO tagging of GenBank accessions stored in
CARD.

The file "aro_categories.tsv" contains a list of ARO terms used to categorize all entries
in CARD and results via the RGI. These categories reflect AMR gene family, target drug
class, and mechanism of resistance.

The file "aro_categories_index.tsv" contains a list a GenBank accessions stored
in CARD cross-referenced with the major categories within the ARO. These categories
reflect AMR gene family, target drug class, and mechanism of resistance, so GenBank
accessions may have more than one cross-reference. For more complex categorization of
the data, use the full ARO available at http://card.mcmaster.ca/download.

The file "snps.txt" lists the SNPs associated with specific detection models.
Loading
Loading