Skip to content

Commit a6c853d

Browse files
Lexing XieLexing Xie
Lexing Xie
authored and
Lexing Xie
committed
initial version of smallset post
1 parent 543e0c2 commit a6c853d

File tree

9 files changed

+128
-0
lines changed

9 files changed

+128
-0
lines changed

content/post/smallset_timelines.md

+128
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
---
2+
title: "Smallset Timelines for Communicating Data Preprocessing Decisions"
3+
description: "Data preprocessing is messy and nuanced but full of consequential decisions, a cartoon strip can be generated for your preprocessing to help understanding and reproduction."
4+
date: "2024-11-18"
5+
draft: false
6+
categories:
7+
- "research"
8+
tags:
9+
- "visualisation"
10+
- "data"
11+
- "integratedAI"
12+
---
13+
14+
##### Posted by _Lexing Xie_ and _Lydia Lucchesi_.
15+
16+
<br/>
17+
<figure class="asn-fig asn-left" style="max-width: 200px;">
18+
<img src="https://github.com/lydialucchesi/smallsets/blob/main/man/figures/hex_sticker.png">
19+
</figure>
20+
Smallset Timelines, and the associated [R package](https://cloud.r-project.org/web/packages/smallsets/index.html) [smallsets](https://lydialucchesi.github.io/smallsets/), faciliate visual documentation of data preprocessing.
21+
22+
<!--more-->
23+
24+
<br/>
25+
26+
Data preprocessing is a crucial intermediate stage in quantitative data analysis. During this stage, data practitioners decide how to resolve dataset issues and transform, clean, and format the dataset(s). It
27+
can be a challenging stage, full of decisions that have the potential to influence analytical outcomes. Yet, data preprocessing is often treated as behind-the-scenes work and overlooked in research dissemination. This discrepancy, in the practice and presentation of data analytics,
28+
is limiting when it comes to replicating, interpreting, and utilising research outputs.
29+
30+
<br/>
31+
32+
The two central contributions in [Lydia's 2024 PhD Thesis](https://lydialucchesi.github.io/thesis/thesis_LydiaLucchesi.pdf) are Smallset Timelines and smallsets. The Smallset Timeline is a static
33+
and compact visualisation, documenting the sequence of decisions in a preprocessing pipeline;
34+
it is composed of small data snapshots of different preprocessing steps. The smallsets software builds a Smallset Timeline from a user’s data preprocessing script, containing structured
35+
comments with snapshot instructions. Together, Smallset Timelines and smallsets are designed to support the production of accessible data preprocessing documentation.
36+
37+
This post illustrates these contributions with four examples, along with an example notebook that produces them.
38+
39+
1. <a href="#EX1">Ebirds data in citizen science</a>
40+
1. <a href="#EX2">HMDA homeloan data, reflecting nuances in defining and reporting on race</a>
41+
1. <a href="#EX3">Examining fairness in income classification from American Community Survey</a>
42+
1. <a href="#EX4">NASA software defect data</a>
43+
44+
We will conclude this overview with <a href="#notebook">an example notebook</a> to illustrate the ease of using smallsets in exisitng data-preprocessing code, along with an <a href="#faq">FAQ</a>.
45+
46+
<h5 id="EX1"></h5>
47+
48+
#### **Example 1: Ebirds Data in Citizen Science**
49+
50+
<figure class="asn-fig asn-left" style="max-width: 750px;">
51+
<img src="/img/smallset/ebird.png">
52+
</figure>
53+
54+
55+
<br/>
56+
57+
<h5 id="EX2"></h5>
58+
59+
#### **Example 2: HMDA Homeloan Data - Nuances in Defining and Processing Race**
60+
61+
<figure class="asn-fig asn-left" style="max-width: 750px;">
62+
<img src="/img/smallset/hmda_A.png">
63+
</figure>
64+
65+
<figure class="asn-fig asn-left" style="max-width: 750px;">
66+
<img src="/img/smallset/hmda_B.png">
67+
</figure>
68+
69+
70+
<h5 id="EX3"></h5>
71+
72+
#### **Example 3: Examining Fairness in Income Classification**
73+
74+
75+
<figure class="asn-fig asn-left" style="max-width: 550px;">
76+
<img src="/img/smallset/acs.png">
77+
<figcaption>
78+
Smallset Timeline of ACS California data preprocessed with the validity-median
79+
setting. Smallset selected with random sampling. The preprocessing script and smallsets
80+
code for this figure are in the code section <a href="#notebook">below</a>.
81+
</figcaption>
82+
</figure>
83+
84+
<figure class="asn-fig asn-left" style="max-width: 550px;">
85+
<img src="/img/smallset/fairness.png">
86+
<figcaption>
87+
The effect of four different preprocessing settings on data and prediction. Plot
88+
a) shows dataset imbalance by gender. Plots b) and c) show group fairness measures in predictions from a logistic regression model. Error bars refer to 95% Newcombe intervals.
89+
</figcaption>
90+
</figure>
91+
92+
<h5 id="EX4"></h5>
93+
94+
#### **Example 4: A widely-used dataset of software defects**
95+
96+
<figure class="asn-fig asn-left" style="max-width: 750px;">
97+
<img src="/img/smallset/gray_general.png">
98+
</figure>
99+
100+
101+
<h5 id="notebook"></h5>
102+
103+
#### **Example notebook for the fairness example**
104+
105+
<figure class="asn-fig asn-left" style="max-width: 550px;">
106+
<img src="/img/smallset/notebook1.png">
107+
</figure>
108+
109+
<figure class="asn-fig asn-left" style="max-width: 550px;">
110+
<img src="/img/smallset/notebook2.png">
111+
</figure>
112+
113+
<h5 id="faq"></h5>
114+
115+
#### **FAQ** (detailed answers coming soon, new questions most welcome)
116+
117+
* _Will smallsets automate data-preprocessing?_ In short, no.
118+
* _Is Python code supported?_ Yes, in ipython notebooks.
119+
* _Will smallsets support preprocessing code across different scripts?_ Not yet.
120+
* _Will smallsets support word embeddings, large language models and the like?_ Not yet, let us know what you think are important to support.
121+
122+
<br/>
123+
124+
#### **Resources**
125+
126+
* [Smallset Timelines: A Visual Representation of Data Preprocessing Decisions](https://arxiv.org/abs/2206.04875), Lydia R. Lucchesi, Petra M. Kuhnert, Jenny L. Davis, and Lexing Xie, Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022
127+
<h5 id="thesis"></h5>
128+
* [Visualisation and Software to Communicate Data Preprocessing Decisions](https://lydialucchesi.github.io/thesis/thesis_LydiaLucchesi.pdf), Lydia R. Lucchesi, PhD Thesis, The Australian National University, 2024

static/img/smallset/acs.png

126 KB
Loading

static/img/smallset/ebird.png

198 KB
Loading

static/img/smallset/fairness.png

87.5 KB
Loading

static/img/smallset/gray_general.pdf

36.5 KB
Binary file not shown.

static/img/smallset/hmda_A.png

130 KB
Loading

static/img/smallset/hmda_B.png

148 KB
Loading

static/img/smallset/notebook1.png

56.9 KB
Loading

static/img/smallset/notebook2.png

79.8 KB
Loading

0 commit comments

Comments
 (0)