Skip to content

Commit 2fdb2d0

Browse files
author
Aleksandr Popov
authored
Merge pull request #247 from immunomind/dev
2 parents a6a5059 + 2b6d707 commit 2fdb2d0

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+335
-340
lines changed

DESCRIPTION

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Package: immunarch
22
Type: Package
33
Title: Bioinformatics Analysis of T-Cell and B-Cell Immune Repertoires
4-
Version: 0.6.8
4+
Version: 0.6.9
55
Authors@R: c(
66
person("Vadim I.", "Nazarov", , "[email protected]", c("aut", "cre")),
77
person("Vasily O.", "Tsvetkov", , role = "aut"),
@@ -83,6 +83,6 @@ Suggests:
8383
rmarkdown
8484
VignetteBuilder: knitr
8585
Encoding: UTF-8
86-
RoxygenNote: 7.1.2
86+
RoxygenNote: 7.2.0
8787
LazyData: true
8888
LazyDataCompression: xz

R/align_lineage.R

+5-5
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
#' This function aligns all sequences incliding germline that belong to one clonal lineage and one cluster.
2-
#' After clustering, building clonal lineage and germline, the next step is to analyze the degree of mutation
3-
#' and maturity of each clonal lineage. This allows you to find high mature cells and cells with a large
1+
#' This function aligns all sequences (incliding germline) that belong to one clonal lineage and one cluster.
2+
#' After clustering and building the clonal lineage and germline, the next step is to analyze the degree of mutation
3+
#' and maturity of each clonal lineage. This allows for finding high mature cells and cells with a large
44
#' number of offspring. The phylogenetic analysis will find mutations that increase the affinity of BCR.
55
#' Making alignment of the sequence is the first step towards sequence analysis including BCR.
66
#'
@@ -34,7 +34,7 @@
3434
#' (will be saved in output table only if .verbose_output parameter is set to TRUE).
3535
#'
3636
#' @param .prepare_threads Number of threads to prepare results table.
37-
#' High number can cause heavy memory usage!
37+
#' Please note that high number can cause heavy memory usage!
3838
#'
3939
#' @param .align_threads Number of threads for lineage alignment.
4040
#'
@@ -47,7 +47,7 @@
4747
#' increases memory usage. If FALSE, only aligned clusters and columns required for repClonalFamily() calculation
4848
#' will be included in the output.
4949
#'
50-
#' @param .nofail Return NA instead of stopping if Clustal W is not installed.
50+
#' @param .nofail Will return NA instead of stopping if Clustal W is not installed.
5151
#' Used to avoid raising errors in examples on computers where Clustal W is not installed.
5252
#'
5353
#' @return

R/clonality.R

+2-2
Original file line numberDiff line numberDiff line change
@@ -34,10 +34,10 @@
3434
#' such as 10, 100 and so on.
3535
#'
3636
#' Set \code{"rare"} to estimate relative abundance for the groups of rare clonotypes
37-
#' with low counts. Use \code{".bound"} to define the boundaries of clonotype groups.
37+
#' with low counts. Use \code{".bound"} to define the threshold of clonotype groups.
3838
#'
3939
#' @param .perc A single numerical value ranging from 0 to 100.
40-
#' @param .clone.types A named numerical vector with the boundaries of the half-closed
40+
#' @param .clone.types A named numerical vector with the threshold of the half-closed
4141
#' intervals that mark off clonal groups.
4242
#' @param .head A numerical vector with ranges of the top clonotypes.
4343
#' @param .bound A numerical vector with ranges of abundance for the rare clonotypes in

R/clustering.R

+9-9
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
#' @importFrom factoextra hcut fviz_nbclust
99
#' @importFrom stats kmeans as.dist cmdscale dist
1010
#'
11-
#' @description Cluster the data with one of the following methods:
11+
#' @description Clusters the data with one of the following methods:
1212
#'
1313
#' - \code{immunr_hclust} clusters the data using the hierarchical clustering from \link[factoextra]{hcut};
1414
#'
@@ -26,7 +26,7 @@
2626
#'
2727
#' @param .data Matrix or data frame with features, distance matrix or output from \link{repOverlapAnalysis} or \link{geneUsageAnalysis} functions.
2828
#'
29-
#' @param .k The number of clusters to create, passed as \code{k} to \link[factoextra]{hcut} or as \code{centers} to \link{kmeans}.
29+
#' @param .k The number of clusters to create, defined as \code{k} to \link[factoextra]{hcut} or as \code{centers} to \link{kmeans}.
3030
#'
3131
#' @param .k.max Limits the maximum number of clusters. It is passed as \code{k.max} to \link{fviz_nbclust} for \code{immunr_hclust} and \code{immunr_kmeans}.
3232
#'
@@ -41,15 +41,15 @@
4141
#' @param .dist If TRUE then ".data" is expected to be a distance matrix. If FALSE then the euclidean distance is computed for the input objects.
4242
#'
4343
#' @return
44-
#' \code{immunr_hclust} - list with two elements. First element is an output from \link{hcut}.
45-
#' Second element is an output from \link{fviz_nbclust}
44+
#' \code{immunr_hclust} - list with two elements. The first element is an output from \link{hcut}.
45+
#' The second element is an output from \link{fviz_nbclust}
4646
#'
47-
#' \code{immunr_kmeans} - list with three elements. First element is an output from \link{kmeans}.
48-
#' Second element is an output from \link{fviz_nbclust}.
49-
#' Third element is the input dataset \code{.data}.
47+
#' \code{immunr_kmeans} - list with three elements. The first element is an output from \link{kmeans}.
48+
#' The second element is an output from \link{fviz_nbclust}.
49+
#' The third element is the input dataset \code{.data}.
5050
#'
51-
#' \code{immunr_dbscan} - list with two elements. First element is an output from \link{dbscan}.
52-
#' Second element is the input dataset \code{.data}.
51+
#' \code{immunr_dbscan} - list with two elements. The first element is an output from \link{dbscan}.
52+
#' The second element is the input dataset \code{.data}.
5353
#'
5454
#' @examples
5555
#' data(immdata)

R/data_docs.R

+5-5
Original file line numberDiff line numberDiff line change
@@ -68,8 +68,8 @@ AA_TABLE_REVERSED <- AA_TABLE_REVERSED[order(names(AA_TABLE_REVERSED))]
6868
#'
6969
#' @description A dataset with single chain TCR data for testing and examplatory purposes.
7070
#'
71-
#' @format A list of two elements. First element ("data") is a list with data frames with clonotype tables.
72-
#' Second element ("meta") is a metadata table.
71+
#' @format A list of two elements. The first element ("data") is a list with data frames with clonotype tables.
72+
#' The second element ("meta") is a metadata table.
7373
#' \describe{
7474
#' \item{data}{List of immune repertoire data frames.}
7575
#' \item{meta}{Metadata}
@@ -84,9 +84,9 @@ AA_TABLE_REVERSED <- AA_TABLE_REVERSED[order(names(AA_TABLE_REVERSED))]
8484
#'
8585
#' @description A dataset with BCR data for testing and examplatory purposes.
8686
#'
87-
#' @format A list of two elements. First element ("data") is a list of 1 element named "full_clones"
87+
#' @format A list of two elements. The first element ("data") is a list of 1 element named "full_clones"
8888
#' that contains immune repertoire data frame.
89-
#' Second element ("meta") is empty metadata table.
89+
#' The second element ("meta") is empty metadata table.
9090
#' \describe{
9191
#' \item{data}{List of immune repertoire data frames.}
9292
#' \item{meta}{Metadata}
@@ -101,7 +101,7 @@ AA_TABLE_REVERSED <- AA_TABLE_REVERSED[order(names(AA_TABLE_REVERSED))]
101101
#'
102102
#' @description A dataset with paired chain IG data for testing and examplatory purposes.
103103
#'
104-
#' @format A list of four elements.
104+
#' @format A list of four elements:
105105
#' "data" is a list with data frames with clonotype tables.
106106
#' "meta" is a metadata table.
107107
#' "bc_patients" is a list of barcodes corresponding to specific patients.

R/dimensions.R

+4-4
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ default_scale_fun <- function(x) {
1717
#'
1818
#' @aliases immunr_pca immunr_mds immunr_tsne
1919
#'
20-
#' @description Collect a set of principal variables, reducing the number of not important variables
20+
#' @description Collects a set of principal variables, reducing the number of not important variables
2121
#' to analyse. Dimensionality reduction makes data analysis algorithms work faster and
2222
#' sometimes more accurate, since it also reduces noise in the data. Currently available
2323
#' methods are:
@@ -44,13 +44,13 @@ default_scale_fun <- function(x) {
4444
#' @param .perp The perplexity parameter for \link[Rtsne]{Rtsne}. Sepcifies the number
4545
#' of neighbours each data point must have in the resulting plot.
4646
#'
47-
#' @param .raw If TRUE then return non-processed output from dimensionality reduction
47+
#' @param .raw If TRUE then returns the non-processed output from dimensionality reduction
4848
#' algorithms. Pass FALSE if you want to visualise results.
4949
#'
50-
#' @param .orig If TRUE then return the original result from algorithms. Pass FALSE
50+
#' @param .orig If TRUE then returns the original result from algorithms. Pass FALSE
5151
#' if you want to visualise results.
5252
#'
53-
#' @param .dist If TRUE then assume ".data" is a distance matrix.
53+
#' @param .dist If TRUE then assumes that ".data" is a distance matrix.
5454
#'
5555
#' @param ... Other parameters passed to \link[Rtsne]{Rtsne}.
5656
#'

R/distance.R

+4-4
Original file line numberDiff line numberDiff line change
@@ -18,17 +18,17 @@
1818
#'
1919
#' Every object must have columns in the immunarch compatible format \link{immunarch_data_format}
2020
#'
21-
#' @param .col A string that specifies the column name to be processed. Default value is 'CDR3.nt'.
21+
#' @param .col A string that specifies the column name to be processed. The default value is 'CDR3.nt'.
2222
#'
2323
#' @param .method Character value or user-defined function.
2424
#'
25-
#' @param .group_by Character vector of column names to group sequence by. Default value is c("V.first", "J.first"). Columns "V.first" and "J.first" containing first genes without allele suffixes are calculated automatically from "V.name" and "J.name" if absent in the data. Pass NA for no grouping options.
25+
#' @param .group_by Character vector of column names to group sequence by. The default value is c("V.first", "J.first"). Columns "V.first" and "J.first" containing first genes without allele suffixes are calculated automatically from "V.name" and "J.name" if absent in the data. Pass NA for no grouping options.
2626
#'
27-
#' @param .group_by_seqLength If TRUE - add grouping by sequence length of .col argument
27+
#' @param .group_by_seqLength If TRUE - adds grouping by sequence length of .col argument
2828
#'
2929
#' @param ... Extra arguments for user-defined function.
3030
#'
31-
#' Default value is \code{'hamming'} for Hamming distance which counts the number of character substitutions that turns b into a.
31+
#' The default value is \code{'hamming'} for Hamming distance which counts the number of character substitutions that turns b into a.
3232
#' If a and b have different number of characters the distance is Inf.
3333
#'
3434
#' Other possible values are:

R/diversity.R

+18-16
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ if (getRversion() >= "2.15.1") {
33
}
44

55

6-
#' Main function for immune repertoire diversity estimation
6+
#' The main function for immune repertoire diversity estimation
77
#'
88
#' @concept diversity
99
#'
@@ -16,8 +16,7 @@ if (getRversion() >= "2.15.1") {
1616
#' @importFrom rlang sym
1717
#'
1818
#' @description
19-
#' This is a utility function to estimate the diversity of species or objects
20-
#' in the given distribution.
19+
#' This is a utility function to estimate the diversity of species or objects in the given distribution.
2120
#'
2221
#' Note: functions will check if .data is a distribution of a random variable (sum == 1) or not.
2322
#' To force normalisation and / or to prevent this, set .do.norm to TRUE (do normalisation)
@@ -35,7 +34,7 @@ if (getRversion() >= "2.15.1") {
3534
#'
3635
#' Note: each connection must represent a separate repertoire.
3736
#'
38-
#' @param .method Pick a method used for estimation out of a following list: chao1,
37+
#' @param .method Picks a method used for estimation out of a following list: chao1,
3938
#' hill, div, gini.simp, inv.simp, gini, raref, d50, dxx.
4039
#' @param .col A string that specifies the column(s) to be processed. Pass one of the
4140
#' following strings, separated by the plus sign: "nt" for nucleotide sequences,
@@ -51,11 +50,11 @@ if (getRversion() >= "2.15.1") {
5150
#' @param .extrapolation An integer. An upper limit for the number of clones to extrapolate to.
5251
#' Pass 0 (zero) to turn extrapolation subroutines off.
5352
#' @param .perc Set the percent to dXX index measurement.
54-
#' @param .norm Normalise rarefaction curves.
55-
#' @param .verbose If TRUE then output progress.
56-
#' @param .do.norm One of the three values - NA, TRUE or FALSE. If NA then check for distrubution (sum(.data) == 1)
57-
#' and normalise if needed with the given laplace correction value. if TRUE then do normalisation and laplace
58-
#' correction. If FALSE then don't do normalisaton and laplace correction.
53+
#' @param .norm Normalises rarefaction curves.
54+
#' @param .verbose If TRUE then outputs progress.
55+
#' @param .do.norm One of the three values - NA, TRUE or FALSE. If NA then checks for distrubution (sum(.data) == 1)
56+
#' and normalises if needed with the given laplace correction value. if TRUE then does normalisation and laplace
57+
#' correction. If FALSE then doesn't do neither normalisaton nor laplace correction.
5958
#' @param .laplace A numeric value, which is used as a pseudocount for Laplace
6059
#' smoothing.
6160
#'
@@ -362,7 +361,7 @@ rarefaction <- function(.data, .step = NA, .quantile = c(.025, .975),
362361
}
363362

364363
if (.verbose) {
365-
pb <- set_pb(sum(sapply(1:length(.data), function(i) {
364+
pb <- set_pb(sum(sapply(seq_along(.data), function(i) {
366365
bc.vec <- .data[[i]]
367366
bc.sum <- sum(.data[[i]])
368367
sizes <- seq(.step, bc.sum, .step)
@@ -373,10 +372,13 @@ rarefaction <- function(.data, .step = NA, .quantile = c(.025, .975),
373372
})))
374373
}
375374

376-
muc.list <- lapply(1:length(.data), function(i) {
375+
muc.list <- lapply(seq_along(.data), function(i) {
377376
Sobs <- length(.data[[i]])
378377
bc.vec <- .data[[i]]
379-
Sest <- chao1(bc.vec)
378+
Sest <- chao1(bc.vec)[1]
379+
if (is.na(Sest)) {
380+
Sest <- Sobs
381+
}
380382
n <- sum(bc.vec)
381383
sizes <- seq(.step, n, .step)
382384
# if (sizes[length(sizes)] != n) {
@@ -389,11 +391,11 @@ rarefaction <- function(.data, .step = NA, .quantile = c(.025, .975),
389391
alphas <- sapply(freqs, function(k) .alpha(n, k, sz))
390392

391393
# poisson
392-
Sind <- sum(sapply(1:length(freqs), function(k) (1 - alphas[k]) * counts[k]))
393-
if (Sest[1] == Sobs) {
394+
Sind <- sum(sapply(seq_along(freqs), function(k) (1 - alphas[k]) * counts[k]))
395+
if (Sest == Sobs) {
394396
SD <- 0
395397
} else {
396-
SD <- sqrt(sum(sapply(1:length(freqs), function(k) (1 - alphas[k])^2 * counts[k])) - Sind^2 / Sest[1])
398+
SD <- sqrt(sum(sapply(seq_along(freqs), function(k) (1 - alphas[k])^2 * counts[k])) - Sind^2 / Sest[1])
397399
}
398400
t <- Sind - Sobs
399401
if (t != 0) {
@@ -419,7 +421,7 @@ rarefaction <- function(.data, .step = NA, .quantile = c(.025, .975),
419421
)
420422
if (length(sizes) != 1) {
421423
ex.res <- t(sapply(sizes, function(sz) {
422-
f0 <- Sest[1] - Sobs
424+
f0 <- Sest - Sobs
423425
f1 <- counts["1"]
424426
if (is.na(f1) || f0 == 0) {
425427
Sind <- Sobs

R/dynamics.R

+5-5
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
#' @aliases trackClonotypes
99
#'
1010
#' @description
11-
#' Track the temporal dynamics of clonotypes in repertoires. For example, tracking across multiple
11+
#' Tracks the temporal dynamics of clonotypes in repertoires. For example, tracking across multiple
1212
#' time points after vaccination.
1313
#'
1414
#' @param .data The data to process. It can be a \link{data.frame}, a
@@ -25,12 +25,12 @@
2525
#'
2626
#' @param .which An argument that regulates which clonotypes to choose for tracking. There are three options for this argument:
2727
#'
28-
#' 1) pass a list with two elements \code{list(X, Y)}, where \code{X} is the name or the index of a target repertoire from ".data", and
28+
#' 1) passes a list with two elements \code{list(X, Y)}, where \code{X} is the name or the index of a target repertoire from ".data", and
2929
#' \code{Y} is the number of the most abundant clonotypes to take from \code{X}.
3030
#'
31-
#' 2) pass a character vector of sequences to take from all data frames;
31+
#' 2) passes a character vector of sequences to take from all data frames;
3232
#'
33-
#' 3) pass a data frame (data table, database) with one or more columns - first for sequences, and other for gene segments (if applicable).
33+
#' 3) passes a data frame (data table, database) with one or more columns - first for sequences, and other for gene segments (if applicable).
3434
#'
3535
#' See the "Examples" below with examples for each option.
3636
#'
@@ -40,7 +40,7 @@
4040
#' sequences with Joining genes, or any combination of the above.
4141
#' Used only if ".which" has option 1) or option 2).
4242
#'
43-
#' @param .norm Logical. If TRUE then use Proportion instead of the number of Clones per clonotype to store
43+
#' @param .norm Logical. If TRUE then uses Proportion instead of the number of Clones per clonotype to store
4444
#' in the function output.
4545
#'
4646
#' @description

R/filters.R

+20-14
Original file line numberDiff line numberDiff line change
@@ -8,37 +8,37 @@
88
#' @importFrom tidyselect starts_with
99
#'
1010
#' @param .data The data to be processed. Must be the list of 2 elements:
11-
#' data table and metadata table.
11+
#' a data table and a metadata table.
1212
#' @param .method Method of filtering. Implemented methods:
1313
#' by.meta, by.repertoire (by.rep), by.clonotype (by.cl)
1414
#' Default value: 'by.clonotype'.
1515
#' @param .query Filtering query. It's a named list of filters that will be applied
1616
#' to data.
1717
#' Possible values for names in this list are dependent on filter methods:
18-
#' - by.meta: filter by metadata. Names in the named list are metadata column headers.
19-
#' - by.repertoire: filter by number of clonotypes or total number of clones in sample.
18+
#' - by.meta: filters by metadata. Names in the named list are metadata column headers.
19+
#' - by.repertoire: filters by the number of clonotypes or total number of clones in sample.
2020
#' Possible names in the named list are "n_clonotypes" and "n_clones".
21-
#' - by.clonotype: filter by data in all samples. Names in the named list are
21+
#' - by.clonotype: filters by data in all samples. Names in the named list are
2222
#' data column headers.
2323
#' Elements of the named list for each of the filters are filtering options.
2424
#' Possible values for filtering options:
25-
#' - include("STR1", "STR2", ...): keep only rows with matching values.
25+
#' - include("STR1", "STR2", ...): keeps only rows with matching values.
2626
#' Available for methods: "by.meta", "by.clonotype".
27-
#' - exclude("STR1", "STR2", ...): remove rows with matching values.
27+
#' - exclude("STR1", "STR2", ...): removes rows with matching values.
2828
#' Available for methods: "by.meta", "by.clonotype".
29-
#' - lessthan(value): keep rows/samples with numeric values less than specified.
29+
#' - lessthan(value): keeps rows/samples with numeric values less than specified.
3030
#' Available for methods: "by.meta", "by.repertoire", "by.clonotype".
31-
#' - morethan(value): keep rows/samples with numeric values more than specified.
31+
#' - morethan(value): keeps rows/samples with numeric values more than specified.
3232
#' Available for methods: "by.meta", "by.repertoire", "by.clonotype".
33-
#' - interval(from, to): keep rows/samples with numeric values that fits in this interval.
33+
#' - interval(from, to): keeps rows/samples with numeric values that fits in this interval.
3434
#' from is inclusive, to is exclusive.
3535
#' Available for methods: "by.meta", "by.repertoire", "by.clonotype".
3636
#' Default value: 'list(CDR3.aa = exclude("partial", "out_of_frame"))'.
3737
#' @param .match Matching method for "include" and "exclude" options in query.
3838
#' Possible values:
39-
#' - exact: match only the exact specified string;
40-
#' - startswith: match all strings starting with the specified substring;
41-
#' - substring: match all strings containing the specified substring.
39+
#' - exact: matches only the exact specified string;
40+
#' - startswith: matches all strings starting with the specified substring;
41+
#' - substring: matches all strings containing the specified substring.
4242
#' Default value: 'exact'.
4343
#'
4444
#' @examples
@@ -227,9 +227,15 @@ filter_table <- function(.table, .column_name, .query_type, .query_args, .match)
227227
if (.match == "exact") {
228228
.table %<>% subset(!get(.column_name) %in% .query_args)
229229
} else if (.match == "startswith") {
230-
.table <- .table[-startswith_rows(.table, .column_name, .query_args), ]
230+
matching_rows <- startswith_rows(.table, .column_name, .query_args)
231+
if (length(matching_rows) > 0) {
232+
.table <- .table[-matching_rows, ]
233+
}
231234
} else if (.match == "substring") {
232-
.table <- .table[-substring_rows(.table, .column_name, .query_args), ]
235+
matching_rows <- substring_rows(.table, .column_name, .query_args)
236+
if (length(matching_rows) > 0) {
237+
.table <- .table[-matching_rows, ]
238+
}
233239
}
234240
} else if (.query_type == "lessthan") {
235241
.table %<>% subset(get(.column_name) < as_numeric_or_fail(.query_args))

0 commit comments

Comments
 (0)