zjnolen · zjnolen · Jun 28, 2024 · Sep 26, 2024 · Sep 26, 2024 · Sep 26, 2024
diff --git a/.github/workflows/main.yaml b/.github/workflows/main.yaml
@@ -21,6 +21,8 @@ jobs:
     steps:
     - name: Checkout with submodules
       uses: actions/checkout@v4
+      with:
+        fetch-depth: 0
     - name: Formatting
       uses: github/super-linter@v7
       env:

diff --git a/.test/config/samples.tsv b/.test/config/samples.tsv
@@ -4,4 +4,5 @@ hist2	Hjelmseryd	historical	low
 #hist3	Hjelmseryd	historical	low
 mod1	Gotafors	modern	high
 mod2	Gotafors	modern	high
-#mod3	Gotafors	modern	high
+#mod3	Gotafors	modern	high
+#another commented line
diff --git a/README.md b/README.md
@@ -12,8 +12,8 @@
 This pipeline is aimed at processing sequencing data and calculating population
 genomic statistics within a genotype likelihood framework using a common
 workflow based on ANGSD and related softwares. As a primary use case of genotype
-likelihood based methods is analysis of samples sequenced to low depth or from
-degraded samples, processing can optionally be adapted to account for DNA
+likelihood based methods is analysis of samples sequenced to low depth or
+from degraded samples, processing can optionally be adapted to account for DNA
 damage. It is aimed at single and multi-population analyses of samples mapped to
 the same reference, so is appropriate for datasets containing individuals from a
 single or multiple populations and allows for examining population structure,

diff --git a/docs/config.md b/docs/config.md
@@ -280,38 +280,27 @@ settings for each analysis are set in the next section.
   - `admix_ngsadmix:` Perform admixture analysis with NGSadmix (`true`/`false`)
   - `relatedness:` Can be performed multiple ways, set any combination of the
     three options below. Note, that I've mostly incorporated these with the
-    R0/R1/KING kinship methods in Waples et al. 2019, _Mol. Ecol._ in mind.
-    These methods differ slightly from how they implement this method, and will
-    give slightly more/less accurate estimates of kinship depending on your
-    reference's relationship to your samples. `ibsrelate_ibs` uses the
-    probabilities of all possible genotypes, so should be the most accurate
-    regardless, but can use a lot of memory and take a long time with many
-    samples. `ibsrelate_sfs` is a bit more efficient, as it does things in a
-    pairwise fashion in parallel, but may be biased if the segregating alleles
-    in your populations are not represented in the reference. `ngsrelate` uses
-    several methods, one of which is similar to `ibsrelate_sfs`, but may be
-    less accurate due to incorporating in less data. In my experience,
-    NGSrelate is suitable to identify up to third degree relatives in the
-    dataset, but only if the exact relationship can be somewhat uncertain (i.e.
-    you don't need to know the difference between, say, parent/offspring and
-    full sibling pairs, or between second degree and third degree relatives).
-    IBSrelate_sfs can get you greater accuracy, but may erroneously inflate
-    kinship if your datset has many alleles not represented in your reference.
-    If you notice, for instance, a large number of third degree relatives
-    (KING ~0.03 - 0.07) in your dataset that is not expected, it may be worth
-    trying the IBS based method (`ibsrelate_ibs`).
-    - `ngsrelate:` Co-estimate inbreeding and pairwise relatedness with
-      NGSrelate (`true`/`false`)
+    R0/R1/KING kinship methods in Waples et al. 2019, *Mol. Ecol.* in mind, and
+    all three methods focus on that. `ibsrelate_ibs` uses the probabilities of
+    all possible genotypes, so should be the most accurate regardless, but can
+    use a lot of memory and take a long time with many samples. `ibsrelate_sfs`
+    is a bit more efficient, as it does things in a pairwise fashion in
+    parallel, but may be biased if the segregating alleles in your populations
+    are not represented in the reference. `ngsrelate` uses NGSrelate, but I have
+    **not** implemented the allele frequency based approaches. So, this should
+    be considered as a SNP based version of the IBSrelate method, which is
+    implemented in NGSrelate.
+    - `ngsrelate:` Estimate relatedness from SNPs using the IBSrelate method.
+      Allele frequency based co-inference of inbreeding and relatedness (the
+      main analyses involved in NGSrelate) may be added down the line, using the
+      dataset or defined populations in `samples.tsv` as the units to calculate
+      frequencies for. (`true`/`false`)
     - `ibsrelate_ibs:` Estimate pairwise relatedness with the IBS based method
-      from Waples et al. 2019, _Mol. Ecol._. This can use a lot of memory, as
-      it has genotype likelihoods for all sites from all samples loaded into
-      memory, so it is done per 'chunk', which still takes a lot of time and
-      memory. (`true`/`false`)
+      for IBSrelate. Enabling this can greatly increase the time needed to build
+      the workflow DAG if you have many samples. (`true`/`false`)
     - `ibsrelate_sfs:` Estimate pairwise relatedness with the SFS based method
-      from Waples et al. 2019, _Mol. Ecol._. Enabling this can greatly increase
-      the time needed to build the workflow DAG if you have many samples. As a
-      form of this method is implemented in NGSrelate, it may be more
-      efficient to only enable that. (`true`/`false`)
+      for IBSrelate. Enabling this can greatly increase the time needed to build
+      the workflow DAG if you have many samples. (`true`/`false`)
   - `1dsfs:` Generates a one dimensional site frequency spectrum for all
     populations in the sample list. Automatically enabled if `thetas_angsd` is
     enabled. (`true`/`false`)

diff --git a/workflow/rules/5.0_relatedness.smk b/workflow/rules/5.0_relatedness.smk
@@ -31,7 +31,8 @@ rule compile_kinship_stats_sfs:
     Compiles kinship stats for all pairs into a single table.
     """
     input:
-        get_kinship,
+        get_kinship_sfs,
+        inds="results/datasets/{dataset}/poplists/{dataset}_all{dp}.indiv.list",
     output:
         "results/datasets/{dataset}/analyses/kinship/ibsrelate_sfs/{dataset}.{ref}_all{dp}_{sites}-filts.kinship",
     log:
@@ -86,28 +87,53 @@ rule doGlf1_ibsrelate:
         """
 
 
-rule ibsrelate:
+rule ibsrelate_chunks:
     input:
-        "results/datasets/{dataset}/glfs/chunks/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.glf.gz",
+        unpack(filt_depth),
+        bams=[
+            "results/datasets/{dataset}/bams/{ind1}.{ref}{dp}.bam",
+            "results/datasets/{dataset}/bams/{ind2}.{ref}{dp}.bam",
+        ],
+        bais=[
+            "results/datasets/{dataset}/bams/{ind1}.{ref}{dp}.bam.bai",
+            "results/datasets/{dataset}/bams/{ind2}.{ref}{dp}.bam.bai",
+        ],
+        ref="results/ref/{ref}/{ref}.fa",
     output:
-        "results/datasets/{dataset}/analyses/kinship/ibsrelate_ibs/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.ibspair",
+        glf=temp(
+            "results/datasets/{dataset}/analyses/kinship/ibsrelate_ibs/{dataset}.{ref}_{ind1}-{ind2}{dp}_{sites}-filts.glf.gz"
+        ),
+        ibspair="results/datasets/{dataset}/analyses/kinship/ibsrelate_ibs/{dataset}.{ref}_{ind1}-{ind2}{dp}_{sites}-filts.ibspair",
     log:
-        "logs/{dataset}/angsd/ibsrelate_ibs/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.log",
+        "logs/{dataset}/angsd/ibsrelate_ibs/{dataset}.{ref}_{ind1}-{ind2}{dp}_{sites}-filts.log",
+    wildcard_constraints:
+        ind1="|".join([i for i in samples.index.tolist()]),
+        ind2="|".join([i for i in samples.index.tolist()]),
     benchmark:
-        "benchmarks/{dataset}/angsd/ibsrelate_ibs/{dataset}.{ref}_{population}{dp}_chunk{chunk}_{sites}-filts.log"
+        "benchmarks/{dataset}/angsd/ibsrelate_ibs/{dataset}.{ref}_{ind1}-{ind2}{dp}_{sites}-filts.log"
     container:
         angsd_container
     params:
-        nind=get_nind,
+        gl_model=config["params"]["angsd"]["gl_model"],
+        extra=config["params"]["angsd"]["extra"],
+        mapQ=config["mapQ"],
+        baseQ=config["baseQ"],
         maxsites=config["chunk_size"],
-        out=lambda w, output: os.path.splitext(output[0])[0],
+        out=lambda w, output: os.path.splitext(output.ibspair)[0],
     resources:
-        runtime="7d",
-    threads: lambda wildcards, attempt: attempt * 10
+        runtime="2d",
+    threads: lambda wildcards, attempt: attempt * 2
     shell:
-        """
-        ibs -glf {input} -model 0 -nInd {params.nind} -allpairs 1 \
-            -m {params.maxsites} -outFileName {params.out}
+        r"""
+        angsd -bam <(readlink -f {input.bams} | perl -pe 'chomp if eof') \
+            -doGlf 1 -GL {params.gl_model} -ref {input.ref} \
+            -nThreads {threads} {params.extra} -minMapQ {params.mapQ} \
+            -minQ {params.baseQ} -sites {input.sites} -out {params.out}
+
+        ibs -glf {output.glf} -model 0 -nInd 2 -allpairs 1 \
+            -outFileName {params.out}
+
+        sed -i 's/^0\t1\t/{wildcards.ind1}\t{wildcards.ind2}\t/' {output.ibspair}
         """
 
 
@@ -117,10 +143,7 @@ rule est_kinship_stats_ibs:
     robust kinship between all sample pairings.
     """
     input:
-        ibs=expand(
-            "results/datasets/{{dataset}}/analyses/kinship/ibsrelate_ibs/{{dataset}}.{{ref}}_{{population}}{{dp}}_chunk{chunk}_{{sites}}-filts.ibspair",
-            chunk=chunklist,
-        ),
+        ibs=get_kinship_ibs,
         inds="results/datasets/{dataset}/poplists/{dataset}_{population}{dp}.indiv.list",
     output:
         "results/datasets/{dataset}/analyses/kinship/ibsrelate_ibs/{dataset}.{ref}_{population}{dp}_{sites}-filts.kinship",

diff --git a/workflow/rules/common.smk b/workflow/rules/common.smk
@@ -754,8 +754,8 @@ def get_ngsrelate_input(wildcards):
     }
 
 
-## Get all possible kinship estimate pairings
-def get_kinship(wildcards):
+## Get all possible kinship estimate pairings (SFS)
+def get_kinship_sfs(wildcards):
     sam = get_samples_from_pop("all")
     if wildcards.dp != "":
         sam = [s for s in sam if s not in config["drop_samples"]]
@@ -774,6 +774,26 @@ def get_kinship(wildcards):
     )
 
 
+## Get all possible kinship estimate pairings (IBS)
+def get_kinship_ibs(wildcards):
+    sam = get_samples_from_pop("all")
+    if wildcards.dp != "":
+        sam = [s for s in sam if s not in config["drop_samples"]]
+    combos = list(itertools.combinations(sam, 2))
+    # sort inds alphebetically, this ensures that should new inds be added
+    # after generating some SFS, the reordering of the combinations won't
+    # lead to generating identical SFS with the individuals swapped
+    combos = [sorted(pair) for pair in combos]
+    ind1 = [pair[0] for pair in combos]
+    ind2 = [pair[1] for pair in combos]
+    return expand(
+        "results/datasets/{{dataset}}/analyses/kinship/ibsrelate_ibs/{{dataset}}.{{ref}}_{ind1}-{ind2}{{dp}}_{{sites}}-filts.ibspair",
+        zip,
+        ind1=ind1,
+        ind2=ind2,
+    )
+
+
 # Fst
 
 

diff --git a/workflow/scripts/kinship_ibs.R b/workflow/scripts/kinship_ibs.R
@@ -35,15 +35,10 @@ df['Kin'] = (df['HETHET'] - 2*(df['IBS0'])) / (df['IBS1'] + 2*df['HETHET'])
 
 df[c('ind1','ind2')] <- str_split_fixed(df$pair, "_", 2)
 
-df$ind1 <- as.numeric(df$ind1) + 1
-df$ind2 <- as.numeric(df$ind2) + 1
-
-df <- merge(df, inds, by.x = "ind1", by.y = 0)
-df$ind1 <- df$sample
+df <- merge(df, inds, by.x = "ind1", by.y = "sample")
 df$pop1 <- df$population
 df <- df[,c("ind1","pop1","ind2","R0","R1","Kin")]
-df <- merge(df, inds, by.x = "ind2", by.y = 0)
-df$ind2 <- df$sample
+df <- merge(df, inds, by.x = "ind2", by.y = "sample")
 df$pop2 <- df$population
 df <- df[,c("ind1","ind2","pop1","pop2","R0","R1","Kin")]
 df <- df[rev(order(df$Kin)),]