Data Provision

Overview

Data should be provided as a flat file in CSV format (i.e., comma delimited), stored locally.

At minimum, the data should contain

an outcome variable (can be binary or continuous)
a treatment variable (binary, coded as 0 for 'control' and 1 as 'treated')
a variable containing each observation's subgroup assignment (integer-valued)
Up to 48 other covariates (working on extending this; document issue #).

The user will be given an opportunity to provide the name of the variables that correspond to the (binary) treatment, the outcome, and the subgroup specification.

After providing the datafile, the user will be given an opportunity to preview them. The user can use this preview to identify + provide the names of the treatment, outcome, and subgroup variables.
If the user does not provide estimates of each observation's individual treatment effect (ITE), the subgroup-specific treatment effect is estimated as the difference between the average outcomes in the two treatment groups.
Missing values should be provided as a blank (i.e., will appear as two consecutive commas in the data file).

Structural Expectations

The first line of the data file should be unquoted names of the variables. There should be no row names.
The application will attempt to generate plots for all variables provided within the dataset, so any extraneous variables should be omitted from the dataset ahead of time.
All covariates should be numeric. String-valued covariates are ignored.
- The decimal mark is used to "separate the integer part from the fractional part of a number written in decimal form".
[FUTURE FEATURE] ~~Any string-valued covariates will be treated as factors if there are less than 15 levels (inclusive). Otherwise, the covariate will be omitted from plotting.~~
Any numeric covariates will be treated as categorical if there are less than 15 levels (inclusive). Otherwise, the covariate will be treated as continuous. This has implications for
- plotting: bar chart versus histogram, respectively
- calculating sample means: for categorical covariates, the mean will be the average of the numeric codings
  - At the moment, software can only handle binary covariates. This issue has been documented and support for polyomous covariates will be added in the near future.

Implementation Notes

Supported data types

There exist many data formats and unfortunately, the syntactic process for importing these data into R is highly dependent on the exact formatting (tutorial 1, tutorial 2). CSV was chosen as the primary data format for the following reasons:

ubiquity
the comma delimiter is relatively standard across operating systems (as compared to the tab)
there exist several functions for importing this format efficiently

Importing and wrangling data in R: `data.table` or `readr`/`dplyr`?

The decision on the data import backend was between two popular packages, data.table (CRAN, GitHub) and readr (CRAN, GitHub) to import data + dplyr (CRAN, GitHub) to wrangle them. There is ongoing discussion (e.g., on SO and Quora) on the pros and cons of each, including contributions from the co/developers of each package. To briefly summarize, data.table is faster (particularly for very large [e.g., 100GB in RAM] datasets), but readr/dplyr syntax is more readable, and the packages are better integrated within the ecosystem of R packages. According to readr documentation,

data.table has a function similar to read_csv() called fread(). Compared to fread(), readr functions:

Are slower (currently ~1.2-2x slower). If you want absolutely the best performance, use data.table::fread().

Use a slightly more sophisticated parser, recognising both doubled ("""") and backslash escapes ("""), and can produce factors and date/times directly.

Forces you to supply all parameters, where fread() saves you work by automatically guessing the delimiter, whether or not the file has a header, and how many lines to skip.

Are built on a different underlying infrastructure. readr functions are designed to be quite general, which makes it easier to add support for new rectangular data formats. fread() is designed to be as fast as possible.

In light of the above considerations + the application graphics are generated by ggplot2 (with re-rendering by plotly::ggplotly()) and ggvis -- part of the tidyverse ecosystem -- it made the most sense for data wrangling to be facilitated by demonstrably compatible sibling packages within the tidyverse. However, as the application is developed, we are open to transitioning to data.table.

📬 Suggestions? Create an issue.

📧 General comments? Contact the author.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Provision

Overview

Structural Expectations

Implementation Notes

Supported data types

Importing and wrangling data in R: `data.table` or `readr`/`dplyr`?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Home

Features

Data

Clone this wiki locally

Data Provision

Overview

Structural Expectations

Implementation Notes

Supported data types

Importing and wrangling data in R: data.table or readr/dplyr?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Home

Features

Data

Clone this wiki locally

Importing and wrangling data in R: `data.table` or `readr`/`dplyr`?