A Proposed Workflow to Robustly Analyze Bacterial Transcripts in RNAseq Data from Extracellular Vesicles
This project was created to analyze bacterial (or non-human in general) transcripts from RNA-seq data. The project is organized as follows:
data/ | .fastq files (raw or processed and relevant) - The FASTQs used for the project are available.
database/ | Files necessary for database creation.
results/ | Directory where (1) profiling outputs and (2) results from the analysis are stored.
|
-- counts/ | It contains the number of reads mapped/unmapped in each iteration of the host-mapping and profiling process.
These files are generated by notebook `0_get_fastq_read_counts.ipynb` but require heavy profiling output files that cannot be stored.
-- differencial_abundance/ | Outcome of differential abundance analysis for each S and mode.
-- figures/ | Figures (most of the used in the paper).
-- merged_counts/ | Stats related to the normalization process.
-- profiling/ | Output of profiling for each mode, sample and profiler. The output is provided as TAXPASTA output for genus and species, plus additional txt files from the profiler.
-- summary/ | Count tables for each sample, mode, S and normalization combination.
src/ | Source files.
|
-- version_2/ | This is the latest version.
| |
| -- install/ | Files to prepare the environment and run the profiling.
| | |
| | -- build_pys/ | Python scripts to create the necessary file for the generation of profiler DBs.
| | -- run_profilers/ | Functions used to run profilers.
| | -- X_*.sh | bash scripts to run all the profilers.
| | -- list_vars.sh | List of variables used for the rest of .sh files in the folder.
| | -- sample_processing_pipeline.sh | This is the general sh file to run X_*.sh files from 0 to 2 (3 is excluded).
| -- sh_funcs/ | Derivative bash functions.
| -- table_artificial_taxid.csv | File with all information to create the in silico sample.
| -- list_vars.py | List of variables used for the rest of .ipynb files.
| -- X_*.ipynb | Notebook files to run the analysis of the profiling.
- Clone the Repository:
git clone https://github.com/NanoNeuro/EV_taxprofiling.git cd EV_taxprofiling/src/version_2/install - Set Up the Environment:
Install necessary dependencies for the project. You can create a conda environment using the provided
environment.ymlfile (if available) or install the necessary packages manually.conda env create -f environment.yml # Replace with your environment setup command if different conda activate EV_taxprofiling - Download the files:
You can download the files used for this experiment from GEO database (GSE255317). You should download the fastq files into the
data/. Maybe you need to rename them; you can do it as indata/samples_rnaseq.csv. Also feel free to adapt the code for your samples!!
To create the profiler database, navigate to the build_pys directory and run the relevant Python scripts. This step prepares the required files for database generation.
You can adapt your scripts to include the profilers you deem necessary.
-
Navigate to the
build_pysDirectory:cd build_pys -
Run the Scripts: Execute the Python scripts to generate the necessary database files. Replace
<script_name>with the actual script names provided in the folder.python <script_name>.py
-
Stuff to consider
- The species are downloaded using this command:
ncbi-genome-download -F protein-fasta,fasta -p ${CPUS} -r 10 -P -l complete,chromosome -o ${BASEDIR_PROFILER_DB}/GENERAL/archaea --flat-output -m ${BASEDIR_PROFILER_DB}/GENERAL/archaea/table_archaea.txt archaea. This will download ~50k bateria and ~15k virus, which may be too much. You can change the-lparameter to include fewer species. - The process to create certain DBs is resource extensive! Maybe you need a powerful computer or an HPC to run this part. Unfortunately, the databases are so big that I cannot save them anywhere.
-
Set Up Variables: Edit the
list_vars.shfile to set any necessary variables for your analysis. Open it in a text editor and customize it as needed.nano list_vars.sh
Save and exit.
-
Run the Bash Scripts: Execute the
.shscripts in therun_profilersdirectory to initiate the profiling process.bash X_<script_name>.sh
Ensure you have the necessary permissions to run the scripts.
chmod +x X_<script_name>.sh
-
Monitor Progress: Profiling may take some time depending on the size of your dataset. Check the output logs for progress and troubleshoot any errors.
-
Select the desired notebook (e.g.,
X_<notebook_name>.ipynb). -
Follow the instructions within the notebook cells. Make sure to execute them in sequence.
-
Edit the
list_vars.pyfile as necessary to set variables for analysis, similar to the Bash workflow. -
How to replicate the analysis All required intermediate files are downloadable from Zenodo 10.5281/zenodo.14887264. You can download them and decompress them following the directory structure indicated at the top.
A proposed workflow to analyze bacterial transcripts in RNAseq from blood extracellular vesicles of people with Multiple Sclerosis
Alex M. Ascensión, Miriam Gorostidi-Aicua, Ane Otaegui-Chivite, Ainhoa Alberro, Rocio del Carmen Bravo-Miana, Tamara Castillo-Trivino, Laura Moles, David Otaegui
Frontiers in Microbiology; doi: https://doi.org/10.3389/fmicb.2025.1486661