Skip to content

Commit faa2533

Browse files
author
lexing xie
committed
mods to the smallset post
1 parent a6c853d commit faa2533

File tree

2 files changed

+30
-4
lines changed

2 files changed

+30
-4
lines changed

content/post/smallset_timelines.md

+30-4
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: "Smallset Timelines for Communicating Data Preprocessing Decisions"
3-
description: "Data preprocessing is messy and nuanced but full of consequential decisions, a cartoon strip can be generated for your preprocessing to help understanding and reproduction."
3+
description: "Data preprocessing is messy and nuanced but full of consequential decisions, a `preprocessing cartoon strip` can be generated to help illustrate these decisions."
44
date: "2024-11-18"
55
draft: false
66
categories:
@@ -14,11 +14,12 @@ tags:
1414
##### Posted by _Lexing Xie_ and _Lydia Lucchesi_.
1515

1616
<br/>
17-
<figure class="asn-fig asn-left" style="max-width: 200px;">
18-
<img src="https://github.com/lydialucchesi/smallsets/blob/main/man/figures/hex_sticker.png">
17+
<figure class="asn-fig asn-left" style="max-width: 165px;">
18+
<img src="/img/smallset/hex_sticker.png">
1919
</figure>
20-
Smallset Timelines, and the associated [R package](https://cloud.r-project.org/web/packages/smallsets/index.html) [smallsets](https://lydialucchesi.github.io/smallsets/), faciliate visual documentation of data preprocessing.
20+
Smallset Timelines, and the associated <a href="https://cloud.r-project.org/web/packages/smallsets/index.html">R package</a> <a href=https://lydialucchesi.github.io/smallsets/>smallsets</a>, faciliate visual documentation of data preprocessing.
2121

22+
<p>
2223
<!--more-->
2324

2425
<br/>
@@ -47,8 +48,28 @@ We will conclude this overview with <a href="#notebook">an example notebook</a>
4748

4849
#### **Example 1: Ebirds Data in Citizen Science**
4950

51+
We examine the eBird database, a citizen science program with millions of bird
52+
sightings from across the globe [Sullivan et al., 2009]. Citizen scientists upload their bird
53+
sightings, by completing an eBird checklist form. The form collects information about every bird observed during an observation period. As noted on the eBird website,7 to date the
54+
eBird data has been used in over 930 publications.
55+
56+
Johnston et al. [2021] recommend a series of best practices for using citizen science data.
57+
These recommendations are based on an eBird case study that explored the effects of different
58+
data preparations on statistical inference. The authors found that the combination of using
59+
complete checklists only, spatial subsampling, effort filters,8 and effort covariates produced
60+
the strongest modelling result. As a supplement to the study, Strimas-Mackey et al. [2023]
61+
produced the guide “Best Practices for Using eBird Data,” which provides a step-by-step
62+
implementation of the study’s recommendations in the R programming language.
63+
64+
5065
<figure class="asn-fig asn-left" style="max-width: 750px;">
5166
<img src="/img/smallset/ebird.png">
67+
<figcaption>
68+
Smallset Timeline for the eBird preprocessing steps recommended in Strimas-
69+
Mackey et al. [2023] (see Section 6.2.1). Smallset selected with random sampling. Data
70+
are not printed in snapshots, as per the eBird terms of use. The preprocessing script and
71+
smallsets code for this figure are in <a href="#thesis">Lydia's Thesis</a> Appendix B.3.
72+
</figcaption>
5273
</figure>
5374

5475

@@ -93,6 +114,8 @@ a) shows dataset imbalance by gender. Plots b) and c) show group fairness measur
93114

94115
#### **Example 4: A widely-used dataset of software defects**
95116

117+
In the early 2000s, the NASA Metrics Data Program (MDP) released 13 datasets for software defect detection, which involves developing algorithms to predict bugs in source code.
118+
96119
<figure class="asn-fig asn-left" style="max-width: 750px;">
97120
<img src="/img/smallset/gray_general.png">
98121
</figure>
@@ -114,6 +137,7 @@ a) shows dataset imbalance by gender. Plots b) and c) show group fairness measur
114137

115138
#### **FAQ** (detailed answers coming soon, new questions most welcome)
116139

140+
* _Is smallsets cutomizable?_ Yes, please see this detailed [user guide](https://lydialucchesi.github.io/smallsets/articles/smallsets.html).
117141
* _Will smallsets automate data-preprocessing?_ In short, no.
118142
* _Is Python code supported?_ Yes, in ipython notebooks.
119143
* _Will smallsets support preprocessing code across different scripts?_ Not yet.
@@ -124,5 +148,7 @@ a) shows dataset imbalance by gender. Plots b) and c) show group fairness measur
124148
#### **Resources**
125149

126150
* [Smallset Timelines: A Visual Representation of Data Preprocessing Decisions](https://arxiv.org/abs/2206.04875), Lydia R. Lucchesi, Petra M. Kuhnert, Jenny L. Davis, and Lexing Xie, Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022
151+
127152
<h5 id="thesis"></h5>
153+
128154
* [Visualisation and Software to Communicate Data Preprocessing Decisions](https://lydialucchesi.github.io/thesis/thesis_LydiaLucchesi.pdf), Lydia R. Lucchesi, PhD Thesis, The Australian National University, 2024

static/img/smallset/hex_sticker.png

97.7 KB
Loading

0 commit comments

Comments
 (0)