This is the repo for the homeworks of the BI-2024 Practicum Course
Supplementary materials for project:
by Dmitriy Matach and Kirill Petrikov
Additional files for 16S rRNA amplicone metagenome sequence:
teeth.Rmd-R Markdown-script forDADA2-analisysreads_quality_control.png-DADA2quality control results for each samplereads_quality_control_aggregate.png- aggregatedDADA2quality control resultsasv_table.csv- final ASV-tabel fromDADA2-analisystax_table.csv- ASV taxonomy assignment forMicrobiomeAnalystmetadata.csv- samples metadata forMicrobiomeAnalystnorm_libsizes_0.png-MicrobiomeAnalystlibrary size plot for input samples
Additional R-script for visualization of ASVs corresponding to "red complex" bacteria
library(dplyr)
library(reshape2)
barplotdata <- ps@otu_table[, c(150, 135, 243, 129, 177, 327, 397)]
bpdata <- melt(barplotdata)
bpdata <- bpdata %>%
mutate(affliction = case_when(
Var1 == 'SRR986773.fastq' ~ "Peridontitis and the red complex",
Var1 == 'SRR986774.fastq' ~ "Peridontitis and the red complex",
Var1 == 'SRR986778.fastq' ~ "Peridontitis",
Var1 == 'SRR986779.fastq' ~ "Peridontitis",
Var1 == 'SRR986782.fastq' ~ "Peridontitis",
.default = 'Unafflicted')
)
p <- ggplot(bpdata, aes(x = Var1, y = Var2)) +
geom_point(aes(size = value, col = affliction)) +
scale_x_discrete(guide = guide_axis(angle = 90)) +
xlab('read') +
ylab('ASV')
p
Additional files for metagenome assembled genomes analisys:
sankey_plot.html- Sankey diagram byPavianfor MAGs taxonomy assignment resultsref_unique_genes.gff- Features selected after intersection as unique for Tannerella forsythia modern reference genomeKofamKOALA_result.txt- results ofKofamKOALAsearch on selected proteins unique for reference genome
Create index of referense genome, MAGs contigs alignment
bwa index NC_016610.fasta && \
bwa mem -t 8 NC_016610.fasta G12_assembly.fna | \
samtools view -b --threads 8 | \
samtools sort --threads 8 > alignment.sorted.bamGet alignment statistics
samtools flagstat alignment.sorted.bamCreate bed-file from bam-file
bedtools bamtobed -i alignment.sorted.bam > alignment.bedIntersect referense genome and MAGs contigs, keeping only unique for referense genome
bedtools intersect -v -a NC_016610.gff3 -b alignment.bed > intersect.gffSelect only CDS, filter out «hypothetical» and «pseudo» proteins, keep only transposases and all but transposases
awk '{FS="\t";OFS="\t"} $3 ~ "CDS"' intersect.gff | grep -v 'product=hypothetical' | grep -v 'pseudo=true' | grep 'transposase' > transp.gff
awk '{FS="\t";OFS="\t"} $3 ~ "CDS"' intersect.gff | grep -v 'product=hypothetical' | grep -v 'pseudo=true' | grep -v 'transposase' > cds_inters.gffSelect proteins IDs for further NCBI Entrez batch download and KofamKOALA-analisys
cut -f 9 cds_inters.gff | cut -f 1 -d ";" | cut -f 2 -d "=" | grep -o -P "WP_\d+.\d" > prot_idxs.txt