-
Notifications
You must be signed in to change notification settings - Fork 19
Comparative Genomics Exercise 7: Comparing taxonomic profiles among samples
1. Introduction & ObjectivesIn this exercise, we will perform a comparative analysis of the rhizosphere microbiome of tomato plants under two different conditions:
- Condition C0: Plants inoculated with a consortium of Plant Growth-Promoting Rhizobacteria (PGPR), specifically nitrogen-fixing bacteria (Rhizobium spp.).
- Condition C1: Control plants or plants under dysbiotic conditions where these specific promoters are absent or depleted.
We will use mOTUs taxonomic profiles generated from metagenomic shotgun sequencing. Our goal is to move beyond simple spreadsheets and apply Compositional Data Analysis (CoDA) to statistically validate which bacteria differ between these conditions.
Steps:
- Set up a Python-based bioinformatics environment.
- Execute a robust analysis pipeline that handles relative abundance data correctly (CLR transformation).
- Interpret Alpha and Beta diversity plots.
- Correlate statistical findings (Volcano plots) with "Ground Truth" biological data.
We are working with a "Mock Dataset", which means the sequencing data was synthetically generated based on a known biological composition.
- Input Data: 10 samples total (5 replicates for C0, 5 replicates for C1).
- Sequencing: Illumina PE150 (approx. 120Mbp per sample).
-
Location:
/data/csic_taller/04.taxonomic_profiling/motus/g1/
###The Nature of Simulated Replicates It is important to understand that the 5 replicates (R0-R4) for each condition are not identical copies.
While they are all derived from the "Ground Truth" profiles (see table below), the simulation process introduces stochastic variation to mimic real-world biological and technical noise. This means:
- Biological Variation: Just like in nature, the abundance of specific bacteria varies slightly from plant to plant.
- Sequencing Depth & Bias: The simulation mimics the randomness of the sequencing machine. Some species might be slightly over- or under-represented in different replicates due to random sampling, just as they would be in a real experiment.
Therefore, while the Ground Truth table represents the target or mean composition, the actual mOTUs files you analyze will contain realistic fluctuations around these values.
###The "Ground Truth" Design Before running the code, familiarize yourself with the biological design. This is what we expect our code to statistically confirm despite the variation among replicates.
- Condition_0 (C0 - Inoculated): Designed to have a high relative abundance of the PGPR consortium, specifically Rhizobium species (R. binae, R. etli, R. leguminosarum, etc.).
- Condition_1 (C1 - Dysbiotic/Control): In these samples, the Rhizobium species are absent (0 abundance). Consequently, the relative abundance of other organisms—such as fungal pathogens (Colletotrichum) and neutral environmental bacteria—appears higher mathematically, as they occupy the "space" left empty by the missing symbionts.
To run the analysis script, we need specific Python libraries (scikit-bio, pandas, seaborn, statsmodels). These are pre-packaged in the gdav23 environment.
Step 1: Open your terminal and activate the environment:
eval "$(/home/miniforge3/bin/conda shell.bash hook)"
conda activate gdav23
Step 2: Verify that the input files exist:
ls /data/csic_taller/04.taxonomic_profiling/motus/g1/
You should see files named C0.R0.motus_g1 through C1.R4.motus_g1.
We will use the microbiome.py script. This script automates the loading of mOTUs files, calculates diversity metrics, performs Center Log-Ratio (CLR) transformations, and runs statistical tests.
Step 3: Run the command below.
-
Note: We use wildcards (
*) to group all 5 replicates of Condition 0 into "Group 1" and all replicates of Condition 1 into "Group 2".
python /data/csic_taller/microbiome.py \
--group1 /data/csic_taller/04.taxonomic_profiling/motus/g1/C0* \
--group2 /data/csic_taller/04.taxonomic_profiling/motus/g1/C1* \
--output my_results.png
-
--group1: Defines the first condition (C0 - The PGPR Inoculated group). -
--group2: Defines the second condition (C1 - The Control/Depleted group). -
--output: The name of the resulting image file.
Once the script finishes, open the generated image (my_results.png). It should look similar to the figure below. Let's analyze it panel by panel.
- Observed OTUs: Shows the total count of distinct species detected.
- Shannon Entropy: Measures diversity considering both richness (count) and evenness (balance).
- Chao1: Estimates total richness (including rare species we might have missed).
Analysis Questions:
- Look at the Observed OTUs boxplot. Group 1 (Red, C0) has ~35 species, while Group 2 (Blue, C1) has ~22. Why is C0 richer?
Hint: Look at the "Ground Truth" table. Count how many Rhizobium species are present in Condition_0 vs Condition_1.
- Does the Shannon Entropy confirm that C0 is more diverse?
- Aitchison Distance (Left): This uses CLR-transformed data. It is the mathematically correct way to measure distance in compositional data.
- Bray-Curtis (Right): The classical ecological metric.
Analysis Questions:
- Look at the Aitchison PCA. Do the Red dots (C0) and Blue dots (C1) overlap, or are they completely separated?
- PC1 (the X-axis) explains 72% of the variance. This is a huge number. What does this tell you about the difference between the two conditions?
Interpretation: The difference is not subtle. The microbial communities are fundamentally different.
This plot shows the statistical testing (T-test on CLR data).
-
X-axis (Log Fold Change):
-
Positive (Right): Bacteria more abundant in Group 1 (C0).
-
Negative (Left): Bacteria more abundant in Group 2 (C1).
-
Y-axis (-Log10 P-value): Higher points are more statistically significant.
Analysis Questions:
-
Notice the cluster of Red dots on the far right (Positive Log Fold Change). These are bacteria that are highly abundant in C0 but low/absent in C1. Based on the Ground Truth table, what genus of bacteria are these likely to be?
-
Notice the Blue dots on the left. These are enriched in C1. Looking at the Ground Truth table, which organisms have higher relative abundance in Condition_1?
##6. Conclusion By comparing our Python analysis with the known biological inputs (the table), we can validate our pipeline:
- Alpha Diversity: Correctly identified C0 as richer (due to the added Rhizobium).
- Beta Diversity: Correctly showed massive separation (the community structure is totally changed by the inoculation).
- Differential Abundance: The Volcano plot correctly identified the Rhizobium consortium as the driver of the difference (the red dots on the right).
Final Thought: If this were real experimental data (where we don't have a cheat sheet), this pipeline would have successfully revealed that the inoculation worked and that Rhizobium successfully colonized the rhizosphere in Group 1.