-
Notifications
You must be signed in to change notification settings - Fork 4
Data Provision
Data should be provided as a flat file in CSV format (i.e., comma delimited), stored locally.
At minimum, the data should contain
- an outcome variable (can be binary or continuous)
- a treatment variable (binary, coded as 0 for 'control' and 1 as 'treated')
- a variable containing each observation's subgroup assignment (integer-valued)
- Up to 48 other covariates (working on extending this; document issue #).
The user will be given an opportunity to provide the name of the variables that correspond to the (binary) treatment, the outcome, and the subgroup specification.
- After providing the datafile, the user will be given an opportunity to preview them. The user can use this preview to identify + provide the names of the treatment, outcome, and subgroup variables.
- If the user does not provide estimates of each observation's individual treatment effect (ITE), the subgroup-specific treatment effect is estimated as the difference between the average outcomes in the two treatment groups.
- Missing values should be provided as a blank (i.e., will appear as two consecutive commas in the data file).
- The first line of the data file should be unquoted names of the variables. There should be no row names.
- The application will attempt to generate plots for all variables provided within the dataset, so any extraneous variables should be omitted from the dataset ahead of time.
- All covariates should be numeric. String-valued covariates are ignored.
- The decimal mark is used to "separate the integer part from the fractional part of a number written in decimal form".
-
[FUTURE FEATURE]
Any string-valued covariates will be treated as factors if there are less than 15 levels (inclusive). Otherwise, the covariate will be omitted from plotting. - Any numeric covariates will be treated as categorical if there are less than 15 levels (inclusive). Otherwise, the covariate will be treated as continuous. This has implications for
- plotting: bar chart versus histogram, respectively
- calculating sample means: for categorical covariates, the mean will be the average of the numeric codings
- At the moment, software can only handle binary covariates. This issue has been documented and support for polyomous covariates will be added in the near future.
There exist many data formats and unfortunately, the syntactic process for importing these data into R is highly dependent on the exact formatting (tutorial 1, tutorial 2). CSV was chosen as the primary data format for the following reasons:
- ubiquity
- the comma delimiter is relatively standard across operating systems (as compared to the tab)
- there exist several functions for importing this format efficiently
The decision on the data import backend was between two popular packages, data.table
(CRAN, GitHub) and readr
(CRAN, GitHub) to import data + dplyr
(CRAN, GitHub) to wrangle them.
There is ongoing discussion (e.g., on SO and Quora) on the pros and cons of each, including contributions from the co/developers of each package. To briefly summarize, data.table
is faster (particularly for very large [e.g., 100GB in RAM] datasets), but readr
/dplyr
syntax is more readable, and the packages are better integrated within the ecosystem of R packages. According to readr
documentation,
data.table
has a function similar toread_csv()
calledfread()
. Compared tofread()
,readr
functions:
- Are slower (currently ~1.2-2x slower). If you want absolutely the best performance, use
data.table::fread()
.- Use a slightly more sophisticated parser, recognising both doubled ("""") and backslash escapes ("""), and can produce factors and date/times directly.
- Forces you to supply all parameters, where
fread()
saves you work by automatically guessing the delimiter, whether or not the file has a header, and how many lines to skip.- Are built on a different underlying infrastructure.
readr
functions are designed to be quite general, which makes it easier to add support for new rectangular data formats.fread()
is designed to be as fast as possible.
In light of the above considerations + the application graphics are generated by ggplot2
(with re-rendering by plotly::ggplotly()
) and ggvis
-- part of the tidyverse ecosystem -- it made the most sense for data wrangling to be facilitated by demonstrably compatible sibling packages within the tidyverse. However, as the application is developed, we are open to transitioning to data.table
.