Skip to content

Commit 6de3b2d

Browse files
committed
add new example: CSV Dataset Summarizer
1 parent 8e4bc53 commit 6de3b2d

7 files changed

Lines changed: 290 additions & 0 deletions

File tree

docs/astro.config.mjs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,7 @@ export default defineConfig({
110110
{ label: 'Examples', slug: 'examples' },
111111
{ label: 'Ollama', autogenerate: { directory: 'examples/ollama' } },
112112
{ label: 'Chat', autogenerate: { directory: 'examples/chat' } },
113+
{ label: 'Utility Apps', autogenerate: { directory: 'examples/utility-apps' } },
113114
],
114115
},
115116
{
46.5 KB
Loading
11.9 KB
Loading
18.2 KB
Loading
6.38 KB
Loading
75.8 KB
Loading
Lines changed: 289 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,289 @@
1+
---
2+
title: CSV Dataset Summarizer
3+
description: Learn how to build a web app that accepts a CSV upload and instantly shows key dataset stats — row count, missing values, duplicates, column types, and an interactive data preview.
4+
---
5+
6+
import { Code } from '@astrojs/starlight/components';
7+
import { Aside } from '@astrojs/starlight/components';
8+
import { Image } from 'astro:assets';
9+
10+
import indicatorsImg from "../../../../assets/tutorials/utility-apps/summarize-dataset/indicators-view.webp"
11+
import columnTypeSummaryImg from "../../../../assets/tutorials/utility-apps/summarize-dataset/column-type-summary.webp"
12+
import skrubDataPreviewImg from "../../../../assets/tutorials/utility-apps/summarize-dataset/skrub-data-preview.webp"
13+
import appWithoutDatasetImg from "../../../../assets/tutorials/utility-apps/summarize-dataset/app-without-dataset.webp"
14+
import appOverviewImg from "../../../../assets/tutorials/utility-apps/summarize-dataset/app-overview.webp"
15+
16+
17+
In this tutorial we will build a **CSV dataset summarizer** — a web app where you upload any CSV file and immediately get a structured overview of what's inside.
18+
19+
<Image
20+
src={appOverviewImg}
21+
alt="CSV Dataset Summarizer app in Mercury — indicators, column type table, and interactive data preview visible after uploading a CSV file"
22+
class="docs-image-100"
23+
/>
24+
25+
No code, no terminal, no pandas knowledge required from the user. Just upload and read.
26+
27+
We will use:
28+
29+
- **pandas** for reading and analysing the CSV
30+
- **skrub** for the interactive data preview table
31+
- **Mercury** to turn the notebook into a web app
32+
33+
The full notebook code is available in our [GitHub repository](https://github.com/mljar/mercury-notebook-apps/tree/main/summarize-dataset).
34+
35+
You can also try the [live demo](https://utility-apps.ismvp.org/mercury/summarize-dataset).
36+
37+
## What the app shows
38+
39+
After the user uploads a CSV, the app displays three sections:
40+
41+
1. **Key indicators** — rows, columns, duplicate rows (with %), missing values (with %)
42+
2. **Column type summary** — a table breaking down how many columns are numeric, categorical, text, datetime, boolean, or constant
43+
3. **Data preview** — an interactive table with the first 15 rows, powered by skrub's `TableReport`
44+
45+
## 1. Install packages
46+
47+
```bash
48+
pip install mercury pandas skrub
49+
```
50+
51+
## 2. Import libraries
52+
53+
```python
54+
import mercury as mr
55+
from IPython.display import clear_output
56+
```
57+
58+
<Aside>
59+
`clear_output` from IPython is used to hide empty cells when no file has been uploaded yet, keeping the app clean on first load.
60+
</Aside>
61+
62+
## 3. Add a welcome message
63+
64+
```python
65+
welcome_md = mr.Markdown("# Upload the dataset and check the summary")
66+
```
67+
68+
This message is shown before the user uploads anything. Once a file is uploaded, it gets replaced by the actual summary title.
69+
70+
<Image
71+
src={appWithoutDatasetImg}
72+
alt="CSV Dataset Summarizer app before upload — welcome heading and file upload button visible in the sidebar"
73+
class="docs-image-100"
74+
/>
75+
76+
## 4. Add the file upload widget
77+
78+
```python
79+
input_file = mr.UploadFile(label="Upload your Dataset", accept='.csv', max_file_size='1GB')
80+
```
81+
82+
[`UploadFile`](/docs/widgets/uploadfile/) renders a file picker in the app sidebar.
83+
We restrict it to `.csv` files and allow up to 1 GB.
84+
85+
When the user picks a file:
86+
- `input_file.name` becomes the filename (non-empty string)
87+
- `input_file.value` contains the raw file bytes
88+
89+
Mercury automatically re-runs the notebook when the upload changes, so all cells below react immediately.
90+
91+
## 5. Read the CSV
92+
93+
```python
94+
if input_file.name is not None:
95+
import pandas as pd
96+
from io import BytesIO
97+
from skrub import TableReport
98+
data = BytesIO(input_file.value)
99+
df = pd.read_csv(data)
100+
```
101+
102+
We wrap the raw bytes in `BytesIO` so pandas can read them directly, without saving the file to disk first.
103+
104+
## 6. Compute the summary statistics
105+
106+
```python
107+
if input_file.name is not None:
108+
# shape
109+
row_count = df.shape[0]
110+
col_count = df.shape[1]
111+
112+
# missing values
113+
missing_count = df.isna().sum().sum()
114+
missing_procent = df.isna().mean().mean() * 100
115+
116+
# duplicates
117+
duplicates_count = df.duplicated().sum()
118+
duplicates_procent = (duplicates_count / row_count) * 100
119+
```
120+
121+
Straightforward pandas — nothing unusual here. We keep both the raw counts and percentages so we can show both in the indicators.
122+
123+
## 7. Detect column types
124+
125+
This is the most interesting part of the app. We go beyond pandas' built-in dtypes to catch edge cases.
126+
127+
```python
128+
# basic pandas types
129+
numeric_columns = df.select_dtypes(include=["number"]).columns.tolist()
130+
boolean_columns = df.select_dtypes(include=["bool"]).columns.tolist()
131+
datetime_columns = df.select_dtypes(include=["datetime", "datetimetz"]).columns.tolist()
132+
object_columns = df.select_dtypes(include=["object", "category"]).columns.tolist()
133+
```
134+
135+
### Detecting dates stored as text
136+
137+
A very common real-world problem: date columns that look like `"2024-01-15"` but are stored as plain strings. Pandas reads them as `object` dtype and misses them. We catch them manually:
138+
139+
```python
140+
detected_datetime_columns = []
141+
for col in object_columns:
142+
converted = pd.to_datetime(df[col], errors="coerce")
143+
valid_ratio = converted.notna().mean()
144+
if valid_ratio > 0.8:
145+
detected_datetime_columns.append(col)
146+
147+
datetime_columns = list(set(datetime_columns + detected_datetime_columns))
148+
```
149+
150+
If more than 80% of values in a column parse as a valid date, we treat it as datetime.
151+
152+
### Detecting free-text columns
153+
154+
Long string columns (descriptions, comments, addresses) shouldn't be counted as categorical. We separate them out by average string length:
155+
156+
```python
157+
text_columns = []
158+
for col in object_columns:
159+
if col not in datetime_columns:
160+
avg_text_length = df[col].dropna().astype(str).str.len().mean()
161+
if avg_text_length > 30:
162+
text_columns.append(col)
163+
```
164+
165+
### Constant columns
166+
167+
Columns with only one unique value carry no information and are worth flagging:
168+
169+
```python
170+
constant_columns = [
171+
col for col in all_columns
172+
if df[col].nunique(dropna=False) <= 1
173+
]
174+
```
175+
176+
### Categorical columns
177+
178+
Everything that doesn't fall into numeric, boolean, datetime, or text:
179+
180+
```python
181+
excluded_from_categorical = set(
182+
datetime_columns + text_columns + boolean_columns + numeric_columns
183+
)
184+
categorical_columns = [
185+
col for col in all_columns
186+
if col not in excluded_from_categorical
187+
]
188+
```
189+
190+
## 8. Build the output widgets
191+
192+
### Indicators
193+
194+
```python
195+
ind_basic = mr.Indicator([
196+
mr.Indicator(value=row_count, label="Rows"),
197+
mr.Indicator(value=col_count, label="Columns"),
198+
mr.Indicator(value=duplicates_count, label="Duplicate Rows", delta=f"{duplicates_procent:.2f}%"),
199+
mr.Indicator(value=missing_count, label="Missing Values", delta=f"{missing_procent:.2f}%"),
200+
])
201+
```
202+
203+
[`Indicator`](/docs/widgets/indicator/) displays a big number with an optional delta label below it. Nesting multiple `Indicator` objects inside one renders them side by side.
204+
205+
<Image
206+
src={indicatorsImg}
207+
alt="Close-up of the four Indicator cards: Rows, Columns, Duplicate Rows (0, 0.00%), Missing Values (866, 8.07%)"
208+
class="docs-image-100"
209+
/>
210+
211+
### Column type table
212+
213+
```python
214+
dane = {
215+
"Metric": ["Numeric Columns", "Categorical Columns", "Text Columns",
216+
"Datetime Columns", "Boolean Columns", "Constant Columns"],
217+
"Value": [len(numeric_columns), len(categorical_columns), len(text_columns),
218+
len(datetime_columns), len(boolean_columns), len(constant_columns)],
219+
"Description": [
220+
"Columns with numeric values",
221+
"Columns with categorical values",
222+
"Columns with longer text values",
223+
"Columns detected as dates or timestamps",
224+
"Columns with true/false values",
225+
"Columns with only one unique value",
226+
]
227+
}
228+
tabelka = mr.Table(dane)
229+
```
230+
231+
[`Table`](/docs/widgets/table/) renders a plain dict or DataFrame as a clean HTML table.
232+
233+
<Image
234+
src={columnTypeSummaryImg}
235+
alt="Column Type Summary table with three columns: Metric, Value, and Description — listing numeric, categorical, text, datetime, boolean, and constant column counts"
236+
class="docs-image-100"
237+
/>
238+
239+
### Data preview
240+
241+
```python
242+
report = TableReport(df, n_rows=15)
243+
display(report)
244+
```
245+
246+
`TableReport` from [skrub](https://skrub-data.org/) renders an interactive table with per-column statistics, sortable headers, and value distributions. It does a lot of work for one line of code.
247+
248+
<Image
249+
src={skrubDataPreviewImg}
250+
alt="skrub TableReport showing an interactive data preview — sortable columns, per-column value distributions, and the first 15 rows of the uploaded CSV"
251+
class="docs-image-100"
252+
/>
253+
254+
## 9. Conditional display
255+
256+
The last cells use a simple pattern to show the right content depending on whether a file has been uploaded:
257+
258+
```python
259+
if input_file.name is None:
260+
display(welcome_md)
261+
else:
262+
display(title_md)
263+
```
264+
265+
```python
266+
if input_file.name is not None:
267+
display(ind_basic)
268+
else:
269+
clear_output(wait=False)
270+
```
271+
272+
The same pattern repeats for the column type table and data preview. `clear_output()` hides the cell output entirely when there's nothing to show, so the app doesn't leave empty gaps on first load.
273+
274+
## 10. Run as a web app
275+
276+
Start the Mercury server from the folder containing the notebook:
277+
278+
```bash
279+
mercury
280+
```
281+
282+
Mercury will detect all `*.ipynb` files and serve them as web applications.
283+
284+
## Notes and tips
285+
286+
- The 80% threshold for datetime detection works well in practice, but you can tune it for stricter or looser detection.
287+
- The text column threshold of 30 characters is a heuristic. Short categoricals like country names average well under 30; free-text fields like comments average well above.
288+
- `TableReport` can be slow on very large files. Consider adding `df = df.sample(10_000)` before calling it if you expect datasets with hundreds of thousands of rows.
289+
- To support Excel files as well, add `accept='.csv,.xlsx'` to `UploadFile` and handle both formats with `pd.read_csv` / `pd.read_excel` based on `input_file.name`.

0 commit comments

Comments
 (0)