Skip to content

Conversation

@Cateline
Copy link
Collaborator

…undant files from R folder

Description

What kind of change(s) are included?

  • Feature (adds or updates new capabilities)
  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (these changes would cause existing functionality to not work as expected).

Checklist

Please ensure that all boxes are checked before indicating that this pull request is ready for review.

  • I have read and followed the CONTRIBUTING.md guidelines.
  • I have searched for existing content to ensure this is not a duplicate.
  • I have performed a self-review of these additions (including spelling, grammar, and related).
  • I have added comments to my code to help provide understanding.
  • I have added a test which covers the code changes found within this PR.
  • I have deleted all non-relevant text in this pull request template.
  • Reviewer assignment: Tag a relevant team member to review and approve the changes.
    @jananiravi @falquaddoomi @epbrenner @AbhirupaGhosh

@jananiravi jananiravi added enhancement New feature or request outreachy for outreachy interns bioinfo Bioinformatics related labels Oct 20, 2024
Copy link
Contributor

@falquaddoomi falquaddoomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this looks a lot better! I see that the check is passing, too, which makes sense considering that the package itself is unchanged. I'll have to rely on someone else to review the R code in detail, but it looks reasonable to me.

If this PR replaces #103, I'd suggest closing that one without merging to reduce the potential for confusion.

@Cateline
Copy link
Collaborator Author

Cateline commented Oct 22, 2024 via email

Copy link
Contributor

@epbrenner epbrenner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking better!

If this script is meant to be generalized so you can enter the bug/drug combo of interest and get the results out, I think it needs some refactoring as I mention in line-by-line comments, but for this specific bug/drug combo, it looks like it works all right. If we want to expand this to be generalizable, I suggest we can merge this script and add a new issue to generalize it.

Another suggestion is actually to drop most of the CARD_data files from being included in this PR, though. Only aro_index.tsv, shortname_antibiotics.tsv, and shortname_pathogens.tsv are used from the set, so no point in adding the much larger .fasta sequence files to the PR as well, at least unless you want to refactor much more broadly to search those files for your sequences instead of using rentrez. That would be a major rewrite, and isn't what I'm suggesting here, so I'd just omit the extra CARD data files.



# Mutate data
aro_index <- aro_index %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step yields a lot of NA values, so just clarifying that that's intended. Multi-pathogen genes like aadA or acrB don't parse into the "pathogen / gene / drug" pattern successfully, so you have things like

pathogen,gene,drug
Abau,ampC,BLA
Abau,Abaf,NA
acrB, NA, NA

Copy link

@AbhirupaGhosh AbhirupaGhosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest you can let go of the files like nucleotide...fasta, protein..fasta that you don't need for your R script.

@falquaddoomi
Copy link
Contributor

Hey @Cateline, for future reference you can delete multiple files in a single commit; you don't need to create one commit per deletion.

@Cateline
Copy link
Collaborator Author

Hey @Cateline, for future reference you can delete multiple files in a single commit; you don't need to create one commit per deletion.

Oh, okay. Didn't know that. I was using the Github API to delete them

Cateline and others added 2 commits October 22, 2024 13:15
Change combined FASTA sequences file name

Co-authored-by: Evan Pierce Brenner <[email protected]>
@jananiravi
Copy link
Member

Agree with the broad comments of:

  1. Cleaning up for said bug-drug combo
  2. Rm unnecessary large files
  3. Pooling related commits
  4. Creating a new but related issue (tagging this issue + pr) to generalize for any new bug-drug combo (while bearing in mind companion file requirements and which of these GitHub friendly in terms of size). Could @AbhirupaGhosh @epbrenner create this new issue based on how this structured?
  5. Guess this is the reason Fixes Phase 1 of Issue #27 #103 is closed?

Any other outstanding Qs to fix/merge this issue? If so, I can look at it more carefully later this week.

Thanks, @Cateline !

Updated package loading to use require() for conditional installation. 
Renamed fasta file and removed redundant lines (35-44).
Removed decompression step along with renaming of the zipped file
Copy link
Member

@jananiravi jananiravi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the contribution, @Cateline. With feedback from @AbhirupaGhosh and others, I think this should be ready to go soon in an iteration or more. how are you planning on extending it beyond Saur and DAP?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Cateline, thanks for adding this README. Out of curiosity, are these descriptions already paraphrased from the original source (CARD), or yet to be?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The descriptions are from the original source (CARD) and have not been paraphrased yet

Cateline and others added 9 commits October 25, 2024 00:01
Changed data import function from read.delim to read_delim
- Standardize 'Protein_Accession' naming conventions
- Switch 'sapply' to 'purrr::map' functions
- Rename 'aro_index' to 'resistance_profile' for better context
- Use explicit column names instead of positional arguments where applicable
@Cateline Cateline marked this pull request as draft October 31, 2024 22:34
Update `extract_card_info` function to correctly categorize complex gene entries
-Improve merging process between extracted resistance profile data, antibiotics data, and pathogens data
-Add logic to handle multi-species pathogens and multi-class drugs
- Implement `fetch_fasta_sequence` to retrieve FASTA sequences from Entrez using protein accession IDs.
- Add loop to iterate over `filtered_data`
Copy link
Member

@jananiravi jananiravi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving minor comments. Defer to @AbhirupaGhosh @epbrenner & FA/DM for a full review.

gene <- NA
drug <- drug_class # Default to Drug Class column

# Determine the information based on the split names and patterns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you share an example file (snippet pre and post name cleanup)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you share an example file (snippet pre and post name cleanup)?

Hello @jananiravi , by this do you mean I should use the View() function in R to allow for the visual inspection of the dataset before and after processing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I meant snapshots or example data stored locally (as part of the commit) to be able to run the code and check locally.

Cateline and others added 8 commits November 1, 2024 12:42
Fixed tar file extraction by adding .bz2 suffix to line 16 for proper file handling
 Update file paths for antibiotics and pathogens data in Lines 89-90 for proper loading.
… options for multiclass exclusion and species restriction

This update introduces a filter_resistance_mechanisms function with customizable options for partial drug matches, exclusion of multiclass resistance, and species-specific filtering.
 -Compared original dataset (`aro_index.tsv`) with cleaned dataset (`resistance_profile_data.tsv`).
-Saved snippets of the pre-cleanup data (`aro_index.tsv`) and post-cleanup data (`resistance_profile_data.tsv`) for comparison.
Expanded Bug-Drug.R code to retrieve and save FASTA sequences for ESKAPE pathogens resistant to DAP (Daptomycin)
Copy link
Member

@jananiravi jananiravi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some quick thoughts. Also, look through recent PRs from @awasyn for cross-checks. Feel free to add a code review for that PR as well (related to CARD #111).

gene <- NA
drug <- drug_class # Default to Drug Class column

# Determine the information based on the split names and patterns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I meant snapshots or example data stored locally (as part of the commit) to be able to run the code and check locally.

Comment on lines 247 to 257











Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

# Loop through each Protein Accession in the filtered data to fetch sequences
for (i in 1:nrow(filtered_data_saurdap)) {
# Get the Protein Accession ID
Protein_accession <- filtered_data_saurdap$Protein_Accession[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confusing alternating use of Protein_ vs. protein_accession. 🤔



# Define the output file for the FASTA sequences
output_fasta_file <- "Staph_aureus_Daptomycin_sequences.fasta"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using short names for species (4-char) and drugs (antibiotics, 3-char).
arg = antibiotic resistance genes, for example.
Which shortnames are you planning to use?
cc: @AbhirupaGhosh @charmvang @awasyn @epbrenner

Suggested change
output_fasta_file <- "Staph_aureus_Daptomycin_sequences.fasta"
output_fasta_file <- "Saur_dap_arg.fasta"



# Extract pathogen, gene, drug, and include Protein.Accession from 'CARD Short Name'
extract_card_info <- function(card_short_name, drug_class, `Protein Accession`, `DNA Accession`) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename all colnames with spaces and special characters to now include only _. Also avoid multiple cases.

@AbhirupaGhosh @charmvang @awasyn @epbrenner @the-mayer -- using camelCase for colnames or snake_case (without caps)?

@@ -0,0 +1,76 @@
AAC Abbreviation Molecule
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use this for short nomenclature, e.g., spp_dru_...

Comment on lines +20 to +30
# Install and Load dplyr and readr
packages <- c("dplyr", "readr")

for (pkg in packages) {
if (!require(pkg, character.only = TRUE)) {
install.packages(pkg)
library(pkg, character.only = TRUE)
} else {
library(pkg, character.only = TRUE)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needed?

Comment on lines +16 to +22

# View the pre-cleanup snippet
View(aro_index_snippet)

# View the post-cleanup snippet
View(resistance_profile_data_snippet)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this was used to look through the dataset -- e.g., with glimpse. But I meant actual example input/output data to run and check.

@Cateline
Copy link
Collaborator Author

Cateline commented Nov 26, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bioinfo Bioinformatics related enhancement New feature or request outreachy for outreachy interns

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants