You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: content/post/smallset_timelines.md
+30-4
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
2
title: "Smallset Timelines for Communicating Data Preprocessing Decisions"
3
-
description: "Data preprocessing is messy and nuanced but full of consequential decisions, a cartoon strip can be generated for your preprocessing to help understanding and reproduction."
3
+
description: "Data preprocessing is messy and nuanced but full of consequential decisions, a `preprocessing cartoon strip` can be generated to help illustrate these decisions."
4
4
date: "2024-11-18"
5
5
draft: false
6
6
categories:
@@ -14,11 +14,12 @@ tags:
14
14
##### Posted by _Lexing Xie_ and _Lydia Lucchesi_.
Smallset Timelines, and the associated [R package](https://cloud.r-project.org/web/packages/smallsets/index.html)[smallsets](https://lydialucchesi.github.io/smallsets/), faciliate visual documentation of data preprocessing.
20
+
Smallset Timelines, and the associated <ahref="https://cloud.r-project.org/web/packages/smallsets/index.html">R package</a> <a href=https://lydialucchesi.github.io/smallsets/>smallsets</a>, faciliate visual documentation of data preprocessing.
21
21
22
+
<p>
22
23
<!--more-->
23
24
24
25
<br/>
@@ -47,8 +48,28 @@ We will conclude this overview with <a href="#notebook">an example notebook</a>
47
48
48
49
#### **Example 1: Ebirds Data in Citizen Science**
49
50
51
+
We examine the eBird database, a citizen science program with millions of bird
52
+
sightings from across the globe [Sullivan et al., 2009]. Citizen scientists upload their bird
53
+
sightings, by completing an eBird checklist form. The form collects information about every bird observed during an observation period. As noted on the eBird website,7 to date the
54
+
eBird data has been used in over 930 publications.
55
+
56
+
Johnston et al. [2021] recommend a series of best practices for using citizen science data.
57
+
These recommendations are based on an eBird case study that explored the effects of different
58
+
data preparations on statistical inference. The authors found that the combination of using
59
+
complete checklists only, spatial subsampling, effort filters,8 and effort covariates produced
60
+
the strongest modelling result. As a supplement to the study, Strimas-Mackey et al. [2023]
61
+
produced the guide “Best Practices for Using eBird Data,” which provides a step-by-step
62
+
implementation of the study’s recommendations in the R programming language.
Smallset Timeline for the eBird preprocessing steps recommended in Strimas-
69
+
Mackey et al. [2023] (see Section 6.2.1). Smallset selected with random sampling. Data
70
+
are not printed in snapshots, as per the eBird terms of use. The preprocessing script and
71
+
smallsets code for this figure are in <ahref="#thesis">Lydia's Thesis</a> Appendix B.3.
72
+
</figcaption>
52
73
</figure>
53
74
54
75
@@ -93,6 +114,8 @@ a) shows dataset imbalance by gender. Plots b) and c) show group fairness measur
93
114
94
115
#### **Example 4: A widely-used dataset of software defects**
95
116
117
+
In the early 2000s, the NASA Metrics Data Program (MDP) released 13 datasets for software defect detection, which involves developing algorithms to predict bugs in source code.
@@ -114,6 +137,7 @@ a) shows dataset imbalance by gender. Plots b) and c) show group fairness measur
114
137
115
138
#### **FAQ** (detailed answers coming soon, new questions most welcome)
116
139
140
+
*_Is smallsets cutomizable?_ Yes, please see this detailed [user guide](https://lydialucchesi.github.io/smallsets/articles/smallsets.html).
117
141
*_Will smallsets automate data-preprocessing?_ In short, no.
118
142
*_Is Python code supported?_ Yes, in ipython notebooks.
119
143
*_Will smallsets support preprocessing code across different scripts?_ Not yet.
@@ -124,5 +148,7 @@ a) shows dataset imbalance by gender. Plots b) and c) show group fairness measur
124
148
#### **Resources**
125
149
126
150
*[Smallset Timelines: A Visual Representation of Data Preprocessing Decisions](https://arxiv.org/abs/2206.04875), Lydia R. Lucchesi, Petra M. Kuhnert, Jenny L. Davis, and Lexing Xie, Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022
151
+
127
152
<h5id="thesis"></h5>
153
+
128
154
*[Visualisation and Software to Communicate Data Preprocessing Decisions](https://lydialucchesi.github.io/thesis/thesis_LydiaLucchesi.pdf), Lydia R. Lucchesi, PhD Thesis, The Australian National University, 2024
0 commit comments