Skip to content

bansell/tidyrstats

Repository files navigation

Description

This package contains helper functions for implementing common statistical functions in R, within tidy workflows.

Install

if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}
devtools::install_github("bansell/tidyrstats")
library(tidyrstats)

neg_log()

This function computes the negative logarithm of a numeric input using base 10 by default. This function returns the negative logarithm of x. By default, it uses base 10, but you can specify a different base using the base argument. Designed for quickly transforming p values for statistical analysis.

Say we have summary statistics results for an RNAseq differential expression experiment

head(toptags)
#> # A tibble: 6 × 7
#>   geneID logFC AveExpr     t  P.Value    adj.P.Val     B
#>   <chr>  <dbl>   <dbl> <dbl>    <dbl>        <dbl> <dbl>
#> 1 gene_1 -4.10    7.07 -35.5 3.38e-12 0.0000000466  18.4
#> 2 gene_2 -4.35    5.66 -32.4 8.66e-12 0.0000000466  17.4
#> 3 gene_3 -4.72    6.46 -30.8 1.45e-11 0.0000000466  17.1
#> 4 gene_4 -3.77    6.29 -30.0 1.93e-11 0.0000000466  16.9
#> 5 gene_5 -5.22    4.93 -30.9 1.42e-11 0.0000000466  16.8
#> 6 gene_6 -5.31    8.86 -29.6 2.22e-11 0.0000000466  16.7

We can -log10 transform the P.Value column

toptags |> 
  # -log10 transform the P.Value column
  mutate(neg_logP = neg_log(P.Value)) 
#> # A tibble: 16,624 × 8
#>    geneID  logFC AveExpr     t  P.Value    adj.P.Val     B neg_logP
#>    <chr>   <dbl>   <dbl> <dbl>    <dbl>        <dbl> <dbl>    <dbl>
#>  1 gene_1  -4.10    7.07 -35.5 3.38e-12 0.0000000466  18.4     11.5
#>  2 gene_2  -4.35    5.66 -32.4 8.66e-12 0.0000000466  17.4     11.1
#>  3 gene_3  -4.72    6.46 -30.8 1.45e-11 0.0000000466  17.1     10.8
#>  4 gene_4  -3.77    6.29 -30.0 1.93e-11 0.0000000466  16.9     10.7
#>  5 gene_5  -5.22    4.93 -30.9 1.42e-11 0.0000000466  16.8     10.8
#>  6 gene_6  -5.31    8.86 -29.6 2.22e-11 0.0000000466  16.7     10.7
#>  7 gene_7  -3.82    6.70 -29.4 2.34e-11 0.0000000466  16.7     10.6
#>  8 gene_8   5.19    3.75  29.0 2.72e-11 0.0000000466  16.2     10.6
#>  9 gene_9  -5.66    4.42 -29.4 2.36e-11 0.0000000466  16.1     10.6
#> 10 gene_10 -5.38    6.30 -28.0 3.90e-11 0.0000000526  16.1     10.4
#> # ℹ 16,614 more rows

…then create a volcano plot:

toptags |> 
  # -log10 transform the P.Value column
  mutate(neg_logP = neg_log(P.Value)) |> 
  # volcano plot
  ggplot(aes(x=logFC,y=neg_logP)) + 
  geom_point(size=0.5)

scale_this()

This function scales and centres a numeric vector by subtracting the mean and dividing by the standard deviation.

Unlike base::scale(), it returns a numeric vector, not a matrix, making it play well with tidy data structures and dplyr chained code. Note this function does not allow control over centering or scaling at present.

scale_this` will drop NA values before scaling, by default (as is the case for base::scale()). However the function will include original NA values in the output, and generate a warning message.

iris_dat <- c(head(iris$Sepal.Length),NA)

iris_dat |>  as_tibble() |> mutate(scaled = scale_this(value))
#> Input contains NA values. These are removed before scaling.
#> # A tibble: 7 × 2
#>   value scaled
#>   <dbl>  <dbl>
#> 1   5.1  0.521
#> 2   4.9 -0.174
#> 3   4.7 -0.868
#> 4   4.6 -1.21 
#> 5   5    0.174
#> 6   5.4  1.56 
#> 7  NA   NA

scale_this() will also work with a simple numeric vector:

scale_this(iris_dat)
#> Input contains NA values. These are removed before scaling.
#> [1]  0.5206576 -0.1735525 -0.8677627 -1.2148677  0.1735525  1.5619728         NA

lm_test()

This function provides a wrapper around do(broom::tidy(lm(... , data = .)) which is very difficult to teach in a tidyverse-focused workshop; difficult for students to understand, as well as annoying to code!

lm_test() is compatible with grouped or nested input, and compatible with the native R pipe |> or magrittr pipe %>%. It accepts a quoted or unquoted model formula.

lm_test() returns a tibble containing the tidied output of lm(), sorted by p-value, which includes the following columns:

  • term: The name of the model term (such as the intercept, predictors, or interaction terms).

  • estimate: The estimated coefficient (also known as the beta value).

  • std.error: The standard error associated with the estimate.

  • statistic: The t-statistic used in hypothesis testing.

  • p.value: The p-value indicating the significance of the term.

If the input data is grouped or nested, the output will retain the group identifiers. In both cases, the grouping/nesting identifiers will appear in the left-most columns of the tibble.

For ungrouped input:

mpg %>% lm_test( cty ~ hwy )
#> Results for linear model: cty ~ hwy
#> # A tibble: 2 × 5
#>   term      estimate std.error statistic   p.value
#>   <chr>        <dbl>     <dbl>     <dbl>     <dbl>
#> 1 hwy          0.683    0.0138     49.6  1.87e-125
#> 2 intercept    0.844    0.333       2.53 1.19e-  2

For grouped input:

mpg %>% group_by(class) %>%  lm_test(cty ~ hwy * cyl + 0)
#> Results for linear model: cty ~ hwy * cyl + 0
#> # A tibble: 21 × 6
#>    class      term    estimate std.error statistic  p.value
#>    <chr>      <chr>      <dbl>     <dbl>     <dbl>    <dbl>
#>  1 suv        hwy       0.878     0.0329     26.7  1.26e-34
#>  2 compact    hwy       0.818     0.0396     20.6  3.03e-24
#>  3 midsize    hwy       0.799     0.0423     18.9  6.64e-21
#>  4 pickup     hwy       0.863     0.0399     21.6  7.46e-20
#>  5 subcompact hwy       0.817     0.0548     14.9  5.87e-16
#>  6 minivan    hwy       0.826     0.0510     16.2  2.13e- 7
#>  7 suv        hwy:cyl  -0.0437    0.0120     -3.65 5.59e- 4
#>  8 suv        cyl       0.424     0.150       2.83 6.37e- 3
#>  9 midsize    hwy:cyl  -0.0579    0.0295     -1.96 5.72e- 2
#> 10 2seater    hwy       0.353     0.132       2.68 7.48e- 2
#> # ℹ 11 more rows

For nested input with a quoted model formula:

mpg |>  nest_by(class) |>  lm_test("cty ~ hwy * cyl + 0")
#> Results for linear model: cty ~ hwy * cyl + 0
#> # A tibble: 21 × 6
#>    class      term    estimate std.error statistic  p.value
#>    <chr>      <chr>      <dbl>     <dbl>     <dbl>    <dbl>
#>  1 suv        hwy       0.878     0.0329     26.7  1.26e-34
#>  2 compact    hwy       0.818     0.0396     20.6  3.03e-24
#>  3 midsize    hwy       0.799     0.0423     18.9  6.64e-21
#>  4 pickup     hwy       0.863     0.0399     21.6  7.46e-20
#>  5 subcompact hwy       0.817     0.0548     14.9  5.87e-16
#>  6 minivan    hwy       0.826     0.0510     16.2  2.13e- 7
#>  7 suv        hwy:cyl  -0.0437    0.0120     -3.65 5.59e- 4
#>  8 suv        cyl       0.424     0.150       2.83 6.37e- 3
#>  9 midsize    hwy:cyl  -0.0579    0.0295     -1.96 5.72e- 2
#> 10 2seater    hwy       0.353     0.132       2.68 7.48e- 2
#> # ℹ 11 more rows

About

Tidy helpers for common statistical functions in R

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages