Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ jobs:
steps:
- name: Checkout with submodules
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Formatting
uses: github/super-linter@v7
env:
Expand Down
3 changes: 2 additions & 1 deletion .test/config/samples.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ hist2 Hjelmseryd historical low
#hist3 Hjelmseryd historical low
mod1 Gotafors modern high
mod2 Gotafors modern high
#mod3 Gotafors modern high
#mod3 Gotafors modern high
#another commented line
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@
This pipeline is aimed at processing sequencing data and calculating population
genomic statistics within a genotype likelihood framework using a common
workflow based on ANGSD and related softwares. As a primary use case of genotype
likelihood based methods is analysis of samples sequenced to low depth or from
degraded samples, processing can optionally be adapted to account for DNA
likelihood based methods is analysis of samples sequenced to low depth or
from degraded samples, processing can optionally be adapted to account for DNA
damage. It is aimed at single and multi-population analyses of samples mapped to
the same reference, so is appropriate for datasets containing individuals from a
single or multiple populations and allows for examining population structure,
Expand Down
49 changes: 19 additions & 30 deletions docs/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -280,38 +280,27 @@ settings for each analysis are set in the next section.
- `admix_ngsadmix:` Perform admixture analysis with NGSadmix (`true`/`false`)
- `relatedness:` Can be performed multiple ways, set any combination of the
three options below. Note, that I've mostly incorporated these with the
R0/R1/KING kinship methods in Waples et al. 2019, _Mol. Ecol._ in mind.
These methods differ slightly from how they implement this method, and will
give slightly more/less accurate estimates of kinship depending on your
reference's relationship to your samples. `ibsrelate_ibs` uses the
probabilities of all possible genotypes, so should be the most accurate
regardless, but can use a lot of memory and take a long time with many
samples. `ibsrelate_sfs` is a bit more efficient, as it does things in a
pairwise fashion in parallel, but may be biased if the segregating alleles
in your populations are not represented in the reference. `ngsrelate` uses
several methods, one of which is similar to `ibsrelate_sfs`, but may be
less accurate due to incorporating in less data. In my experience,
NGSrelate is suitable to identify up to third degree relatives in the
dataset, but only if the exact relationship can be somewhat uncertain (i.e.
you don't need to know the difference between, say, parent/offspring and
full sibling pairs, or between second degree and third degree relatives).
IBSrelate_sfs can get you greater accuracy, but may erroneously inflate
kinship if your datset has many alleles not represented in your reference.
If you notice, for instance, a large number of third degree relatives
(KING ~0.03 - 0.07) in your dataset that is not expected, it may be worth
trying the IBS based method (`ibsrelate_ibs`).
- `ngsrelate:` Co-estimate inbreeding and pairwise relatedness with
NGSrelate (`true`/`false`)
R0/R1/KING kinship methods in Waples et al. 2019, *Mol. Ecol.* in mind, and
all three methods focus on that. `ibsrelate_ibs` uses the probabilities of
all possible genotypes, so should be the most accurate regardless, but can
use a lot of memory and take a long time with many samples. `ibsrelate_sfs`
is a bit more efficient, as it does things in a pairwise fashion in
parallel, but may be biased if the segregating alleles in your populations
are not represented in the reference. `ngsrelate` uses NGSrelate, but I have
**not** implemented the allele frequency based approaches. So, this should
be considered as a SNP based version of the IBSrelate method, which is
implemented in NGSrelate.
- `ngsrelate:` Estimate relatedness from SNPs using the IBSrelate method.
Allele frequency based co-inference of inbreeding and relatedness (the
main analyses involved in NGSrelate) may be added down the line, using the
dataset or defined populations in `samples.tsv` as the units to calculate
frequencies for. (`true`/`false`)
- `ibsrelate_ibs:` Estimate pairwise relatedness with the IBS based method
from Waples et al. 2019, _Mol. Ecol._. This can use a lot of memory, as
it has genotype likelihoods for all sites from all samples loaded into
memory, so it is done per 'chunk', which still takes a lot of time and
memory. (`true`/`false`)
for IBSrelate. Enabling this can greatly increase the time needed to build
the workflow DAG if you have many samples. (`true`/`false`)
- `ibsrelate_sfs:` Estimate pairwise relatedness with the SFS based method
from Waples et al. 2019, _Mol. Ecol._. Enabling this can greatly increase
the time needed to build the workflow DAG if you have many samples. As a
form of this method is implemented in NGSrelate, it may be more
efficient to only enable that. (`true`/`false`)
for IBSrelate. Enabling this can greatly increase the time needed to build
the workflow DAG if you have many samples. (`true`/`false`)
- `1dsfs:` Generates a one dimensional site frequency spectrum for all
populations in the sample list. Automatically enabled if `thetas_angsd` is
enabled. (`true`/`false`)
Expand Down
57 changes: 40 additions & 17 deletions workflow/rules/5.0_relatedness.smk
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ rule compile_kinship_stats_sfs:
Compiles kinship stats for all pairs into a single table.
"""
input:
get_kinship,
get_kinship_sfs,
inds="results/datasets/{dataset}/poplists/{dataset}_all{dp}.indiv.list",
output:
"results/datasets/{dataset}/analyses/kinship/ibsrelate_sfs/{dataset}.{ref}_all{dp}_{sites}-filts.kinship",
log:
Expand Down Expand Up @@ -86,28 +87,53 @@ rule doGlf1_ibsrelate:
"""


rule ibsrelate:
rule ibsrelate_chunks:
input:
"results/datasets/{dataset}/glfs/chunks/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.glf.gz",
unpack(filt_depth),
bams=[
"results/datasets/{dataset}/bams/{ind1}.{ref}{dp}.bam",
"results/datasets/{dataset}/bams/{ind2}.{ref}{dp}.bam",
],
bais=[
"results/datasets/{dataset}/bams/{ind1}.{ref}{dp}.bam.bai",
"results/datasets/{dataset}/bams/{ind2}.{ref}{dp}.bam.bai",
],
ref="results/ref/{ref}/{ref}.fa",
output:
"results/datasets/{dataset}/analyses/kinship/ibsrelate_ibs/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.ibspair",
glf=temp(
"results/datasets/{dataset}/analyses/kinship/ibsrelate_ibs/{dataset}.{ref}_{ind1}-{ind2}{dp}_{sites}-filts.glf.gz"
),
ibspair="results/datasets/{dataset}/analyses/kinship/ibsrelate_ibs/{dataset}.{ref}_{ind1}-{ind2}{dp}_{sites}-filts.ibspair",
log:
"logs/{dataset}/angsd/ibsrelate_ibs/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.log",
"logs/{dataset}/angsd/ibsrelate_ibs/{dataset}.{ref}_{ind1}-{ind2}{dp}_{sites}-filts.log",
wildcard_constraints:
ind1="|".join([i for i in samples.index.tolist()]),
ind2="|".join([i for i in samples.index.tolist()]),
benchmark:
"benchmarks/{dataset}/angsd/ibsrelate_ibs/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.log"
"benchmarks/{dataset}/angsd/ibsrelate_ibs/{dataset}.{ref}_{ind1}-{ind2}{dp}_{sites}-filts.log"
container:
angsd_container
params:
nind=get_nind,
gl_model=config["params"]["angsd"]["gl_model"],
extra=config["params"]["angsd"]["extra"],
mapQ=config["mapQ"],
baseQ=config["baseQ"],
maxsites=config["chunk_size"],
out=lambda w, output: os.path.splitext(output[0])[0],
out=lambda w, output: os.path.splitext(output.ibspair)[0],
resources:
runtime="7d",
threads: lambda wildcards, attempt: attempt * 10
runtime="2d",
threads: lambda wildcards, attempt: attempt * 2
shell:
"""
ibs -glf {input} -model 0 -nInd {params.nind} -allpairs 1 \
-m {params.maxsites} -outFileName {params.out}
r"""
angsd -bam <(readlink -f {input.bams} | perl -pe 'chomp if eof') \
-doGlf 1 -GL {params.gl_model} -ref {input.ref} \
-nThreads {threads} {params.extra} -minMapQ {params.mapQ} \
-minQ {params.baseQ} -sites {input.sites} -out {params.out}

ibs -glf {output.glf} -model 0 -nInd 2 -allpairs 1 \
-outFileName {params.out}

sed -i 's/^0\t1\t/{wildcards.ind1}\t{wildcards.ind2}\t/' {output.ibspair}
"""


Expand All @@ -117,10 +143,7 @@ rule est_kinship_stats_ibs:
robust kinship between all sample pairings.
"""
input:
ibs=expand(
"results/datasets/{{dataset}}/analyses/kinship/ibsrelate_ibs/{{dataset}}.{{ref}}_{{population}}{{dp}}_chunk{chunk}_{{sites}}-filts.ibspair",
chunk=chunklist,
),
ibs=get_kinship_ibs,
inds="results/datasets/{dataset}/poplists/{dataset}_{population}{dp}.indiv.list",
output:
"results/datasets/{dataset}/analyses/kinship/ibsrelate_ibs/{dataset}.{ref}_{population}{dp}_{sites}-filts.kinship",
Expand Down
24 changes: 22 additions & 2 deletions workflow/rules/common.smk
Original file line number Diff line number Diff line change
Expand Up @@ -754,8 +754,8 @@ def get_ngsrelate_input(wildcards):
}


## Get all possible kinship estimate pairings
def get_kinship(wildcards):
## Get all possible kinship estimate pairings (SFS)
def get_kinship_sfs(wildcards):
sam = get_samples_from_pop("all")
if wildcards.dp != "":
sam = [s for s in sam if s not in config["drop_samples"]]
Expand All @@ -774,6 +774,26 @@ def get_kinship(wildcards):
)


## Get all possible kinship estimate pairings (IBS)
def get_kinship_ibs(wildcards):
sam = get_samples_from_pop("all")
if wildcards.dp != "":
sam = [s for s in sam if s not in config["drop_samples"]]
combos = list(itertools.combinations(sam, 2))
# sort inds alphebetically, this ensures that should new inds be added
# after generating some SFS, the reordering of the combinations won't
# lead to generating identical SFS with the individuals swapped
combos = [sorted(pair) for pair in combos]
ind1 = [pair[0] for pair in combos]
ind2 = [pair[1] for pair in combos]
return expand(
"results/datasets/{{dataset}}/analyses/kinship/ibsrelate_ibs/{{dataset}}.{{ref}}_{ind1}-{ind2}{{dp}}_{{sites}}-filts.ibspair",
zip,
ind1=ind1,
ind2=ind2,
)


# Fst


Expand Down
9 changes: 2 additions & 7 deletions workflow/scripts/kinship_ibs.R
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,10 @@ df['Kin'] = (df['HETHET'] - 2*(df['IBS0'])) / (df['IBS1'] + 2*df['HETHET'])

df[c('ind1','ind2')] <- str_split_fixed(df$pair, "_", 2)

df$ind1 <- as.numeric(df$ind1) + 1
df$ind2 <- as.numeric(df$ind2) + 1

df <- merge(df, inds, by.x = "ind1", by.y = 0)
df$ind1 <- df$sample
df <- merge(df, inds, by.x = "ind1", by.y = "sample")
df$pop1 <- df$population
df <- df[,c("ind1","pop1","ind2","R0","R1","Kin")]
df <- merge(df, inds, by.x = "ind2", by.y = 0)
df$ind2 <- df$sample
df <- merge(df, inds, by.x = "ind2", by.y = "sample")
df$pop2 <- df$population
df <- df[,c("ind1","ind2","pop1","pop2","R0","R1","Kin")]
df <- df[rev(order(df$Kin)),]
Expand Down