-
Notifications
You must be signed in to change notification settings - Fork 29
Extending mzR
The mzR R/Bioconductor package provides a unified API to the common open and community-driven file formats and parsers available for mass spectrometry data, namely mzXML, mzML and mzData (see vignette for details). It uses C and C++ code from other third party open-source projects and heavily relies on the Rcpp package to, notably, provide a direct mapping from R to C++ infrastructure.
Currently, mzR provides two actual backends to read Mass Spectrometry raw data:
-
netCDFwhich reads, as the name implies,netCDFdata -
RAMPto readmzDataandmzXMLvia the ISBRAMPparser. This backend can also readmzMLthrough the proteowizardRAMPadapteraround the proteowizard infrastructure, but this interface is limited to the lowest common denominator between themzXML/mzData/mzMLformats.
This project is intended to add several related backends to mzR, by providing a direct wrapper around -- and full access to -- the proteowizard msdata object. The candidate will interact closely with Laurent Gatto and Steffen Neumann, and the proteowizard and Rcpp communities.
The pwiz/mzML backend should be a drop-in replacement and pass unit tests also for the Bioconductor XCMS and MSnbase packages. Any XCMS and MSnbase modifications required will be done by Steffen Neumann and Laurent Gatto respectively. Secondly, the pwiz/mzML should provide access to the <chromatogram>s stored in an mzML file (Martens et al. 2011).
The project also aims at facilitating access to identification data in the mzIdentML data format (Jones et al. 2012) through the proteowizard framework. A similar backend, as currently available to raw mass spectrometry files (mzXML, mzML, mzData), will be developed for mzIdentML files.
At the end of the project, the candidate will be familiar with the major mass-spectrometry data formats and main MS toolkits used in proteomics and metabolomics. After successful completion of the project, the candidate will be added to the list of mzR contributors.
- Difficulty: medium to difficult, depending on experience and
C++fluency. - Skills needed: intermediate R programming, knowledge of package development helpful, good knowledge of
Cand especiallyC++essential. The candidate will have to familiarise herself with the mass-spectrometry data, the respective data formats and the proteowizard code base. - Deliverable: pwiz and identificaiton backends to be added to the
mzRpackage. - Mentors: Laurent Gatto and Steffen Neuman, with additional Rcpp support from Dirk Eddelbuettel.
- References: see project description.