Skip to content

use SummarizedExperiment instead of loose data structures #95

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

tnaake
Copy link

@tnaake tnaake commented Mar 26, 2025

Hello @ChristinaSchmidt1

This PR adds to (some extent, see below) SummarizedExperiment functionality to your package (issue #70). The implementation is still missing partly for VizVolcano functions - I didn't fully understand what is going on here and it would be good to discuss this more in detail. Also, most of the examples and the vignettes need to be adjusted for the new data structure. I have also reviewed most of the code and added several comments (via "EDIT:") to the code base where I think the code could be improved. Furthermore, I have adjusted the code style to some extent to BioC style.

Here are some further thoughts on the package (also in light of above):

  • the code should be simplified as much as possible, I find it difficult to follow what is going on for some steps. You could also think of adding more comments to the code
  • the code could be more robust, e.g. for many arguments with several options you could name them in the function argument and use match.arg instead of the tests written by you (it is much simpler and a standard). Also, write helper functions as needed for repetitive code snippets. Test the functions via testthat or alike. In general, add test functions for all functions you used
  • functions should not be defined in other functions, move them outside
  • improve documentation, add information on the type of the argument, e.g. character(n), numeric(n), logical(n). It seems difficult to understand what the structure should be for some of the arguments
  • do not create and use column names that require quoting, e.g. instead of Log2(Distance) just use log2_distance
  • currently the naming of objects is not consistent. You could use (ONLY) camel case for object names and (ONLY) snake case names for function names or differently.
  • Instead of %>% use just |> to remove dependencies. Also what is the added benefit of using operators such as %<>%>. I would rather focus on easy-to-read code than complex code and would not assume that every user that looks at the package knows what these operators mean
  • for some arguments / fcts the naming could be improved, e.g. InputData is quite generic and does not really imply what is the input
  • I have harmonized for most functions the function output (list), this will be a list that has entries "data" and "plot", which are again lists. I have used "data" instead of the "DF" naming since many fcts do not return a "data.frame"

I am happy to continue with the integration of SummarizedExperiment but thought it would be good to quickly check with you if this goes in the right direction and get green lights before investing too much time :)

@tnaake
Copy link
Author

tnaake commented Mar 26, 2025

P.S. maybe it is also a good idea, to have a separate branch SummarizedExperiment before merging it into the main branch.

@ChristinaSchmidt1
Copy link
Member

Hi Thomas,

Thanks for looking into this and making all these comments and changes!

In regard to the SummarizedExperiment file:

  • I would want to maintain both options, the classical passing of DFs as well as SummarizedExperiment files and either detect automatically what was passed or create a paramter se= TRUE/FLASE to leave the choice to the user.
  • At the moment we have many changes on the development branch that we will need to merged into main, so we may need to do this first.
  • In terms of function structure and names: We have planned to reduce dependencies, restructure code and documentation, and I will add the comments of yours to the list (if I understood correctly, you did not make any changes in this regard, but pointed out your observations?)

@tnaake
Copy link
Author

tnaake commented Mar 26, 2025

Hi @ChristinaSchmidt1,

sounds good. Out of curiosity, why would you want to maintain both? SE objects are quite easy to construct, are well structured, are encouraged by BioC since it's a BioC S4 class, and will simplify many steps in the code (e.g. merging)?

Yes, I pointed many out, but also made some changes :D So, whenever there is a "EDIT:" I didn't do any changes, but for other instances I did. Sorry, for being inconsistent here.

@tnaake
Copy link
Author

tnaake commented Mar 26, 2025

P.S. (again and in addition to the points above): Also, looking through your functions, it seems to me that many functions are not specific to metabolite analysis, but could also be used for other data modalities. I was wondering if you would like to reflect this in your naming and documentation that the package is more universal than for metabolite analysis :)

@ChristinaSchmidt1
Copy link
Member

Yes, I completely understand your point. Yet many of the facilities I have talked with, whilst designing the package, wanted DFs as input. Moreover, for the biological clustering functions we need differential analysis results, but can still pass other information. To my knowledge this will not work with se. For differential analysis results and some of the mapping ambiguities I am also not sure how we would link this with the input matrix, sample and feature metadata information. But maybe we can discuss about those specific cases.

For your P.S. - yes I have designed them like this on purpose. They work with any data modality and I will extend the vignettes to other omics and multi-omics in the future. On the dev branch we already have proteomics and transcriptomics and we will write some vignettes for those too. I have not considered to change the package name tough, but to extend the parameter descriptions.

@deeenes
Copy link
Member

deeenes commented Mar 26, 2025

Hi @tnaake, Many thanks for your PR! It takes big steps in directions that we anyway wanted to pursue, namely, the code quality improvements, the BioC compliance, and the use of suitable objects for a more streamlined implementation and API, are all essential for the consolidation of this package. We'll review and merge the PR in the next days. One little note though: please base contributions on the development branch, merge the upstream daily, and open the PR in a state when it has no conflict with the development branch (yes, we should also write the contrib guidelines).

@tnaake
Copy link
Author

tnaake commented Mar 27, 2025

Should I then create a new PR against the development branch and close this one?

@ChristinaSchmidt1
Copy link
Member

ChristinaSchmidt1 commented Mar 27, 2025

We will come back to you about this beginning of next week as we are currently in the process of merging some bigger changes into main and we also need some time to review the current work you have done (and maybe we can already use part of it as is).

In addition, as mentioned, we planned to keep the standard input too by packing all the inputs into an se file as the first step of each function and then proceed (to alow both se input or dfs). This would also involve offering the example data in both formats and having both options under the @example documentation amongst other things we have noted for adding se. So it may make sense to take these needs into account when making further changes.

@ChristinaSchmidt1
Copy link
Member

Hi Thomas,

As promised I am coming back to you about the se PR. We think it would be great if we could properly plan the addition of se input as you have already done a big chunk of this. For this we have a few requirements:

  • We want to keep the current input option as is, but as the first step within the function merge the input into an se and run all downstream analysis on se
  • Additionally, the user can of course provide an se file as input instead of the dfs and we will automatically detect this
  • We want to provide the example data as both, dfs or se file by adding an additional paramter. @examples should include both examples
  • The vignettes will not be changed to se file input at this stage of the project (later we can do this)
  • There are some functions were we would need to discuss how we can adapt se for those, as they are not classical matrix, feature- and sample metadata

Since, we are currently also conduction changes for code readibility, to remove unnessesary dependencies and prepare for Bioconductor checks, we are continuosly working on the development branch. This will also include changes in parameter names. Hence, it would be important to decide on initial goals and proceed step-wise, to ensure no conflicts are generated. As we are meeting next week, we can probably discuss this as it will be easier than in writing.

@deeenes will take care of communicating the details on contrib guidelines for continuous delivery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants