|
| 1 | +--- |
| 2 | +title: CSV Dataset Summarizer |
| 3 | +description: Learn how to build a web app that accepts a CSV upload and instantly shows key dataset stats — row count, missing values, duplicates, column types, and an interactive data preview. |
| 4 | +--- |
| 5 | + |
| 6 | +import { Code } from '@astrojs/starlight/components'; |
| 7 | +import { Aside } from '@astrojs/starlight/components'; |
| 8 | +import { Image } from 'astro:assets'; |
| 9 | + |
| 10 | +import indicatorsImg from "../../../../assets/tutorials/utility-apps/summarize-dataset/indicators-view.webp" |
| 11 | +import columnTypeSummaryImg from "../../../../assets/tutorials/utility-apps/summarize-dataset/column-type-summary.webp" |
| 12 | +import skrubDataPreviewImg from "../../../../assets/tutorials/utility-apps/summarize-dataset/skrub-data-preview.webp" |
| 13 | +import appWithoutDatasetImg from "../../../../assets/tutorials/utility-apps/summarize-dataset/app-without-dataset.webp" |
| 14 | +import appOverviewImg from "../../../../assets/tutorials/utility-apps/summarize-dataset/app-overview.webp" |
| 15 | + |
| 16 | + |
| 17 | +In this tutorial we will build a **CSV dataset summarizer** — a web app where you upload any CSV file and immediately get a structured overview of what's inside. |
| 18 | + |
| 19 | +<Image |
| 20 | + src={appOverviewImg} |
| 21 | + alt="CSV Dataset Summarizer app in Mercury — indicators, column type table, and interactive data preview visible after uploading a CSV file" |
| 22 | + class="docs-image-100" |
| 23 | +/> |
| 24 | + |
| 25 | +No code, no terminal, no pandas knowledge required from the user. Just upload and read. |
| 26 | + |
| 27 | +We will use: |
| 28 | + |
| 29 | +- **pandas** for reading and analysing the CSV |
| 30 | +- **skrub** for the interactive data preview table |
| 31 | +- **Mercury** to turn the notebook into a web app |
| 32 | + |
| 33 | +The full notebook code is available in our [GitHub repository](https://github.com/mljar/mercury-notebook-apps/tree/main/summarize-dataset). |
| 34 | + |
| 35 | +You can also try the [live demo](https://utility-apps.ismvp.org/mercury/summarize-dataset). |
| 36 | + |
| 37 | +## What the app shows |
| 38 | + |
| 39 | +After the user uploads a CSV, the app displays three sections: |
| 40 | + |
| 41 | +1. **Key indicators** — rows, columns, duplicate rows (with %), missing values (with %) |
| 42 | +2. **Column type summary** — a table breaking down how many columns are numeric, categorical, text, datetime, boolean, or constant |
| 43 | +3. **Data preview** — an interactive table with the first 15 rows, powered by skrub's `TableReport` |
| 44 | + |
| 45 | +## 1. Install packages |
| 46 | + |
| 47 | +```bash |
| 48 | +pip install mercury pandas skrub |
| 49 | +``` |
| 50 | + |
| 51 | +## 2. Import libraries |
| 52 | + |
| 53 | +```python |
| 54 | +import mercury as mr |
| 55 | +from IPython.display import clear_output |
| 56 | +``` |
| 57 | + |
| 58 | +<Aside> |
| 59 | +`clear_output` from IPython is used to hide empty cells when no file has been uploaded yet, keeping the app clean on first load. |
| 60 | +</Aside> |
| 61 | + |
| 62 | +## 3. Add a welcome message |
| 63 | + |
| 64 | +```python |
| 65 | +welcome_md = mr.Markdown("# Upload the dataset and check the summary") |
| 66 | +``` |
| 67 | + |
| 68 | +This message is shown before the user uploads anything. Once a file is uploaded, it gets replaced by the actual summary title. |
| 69 | + |
| 70 | +<Image |
| 71 | + src={appWithoutDatasetImg} |
| 72 | + alt="CSV Dataset Summarizer app before upload — welcome heading and file upload button visible in the sidebar" |
| 73 | + class="docs-image-100" |
| 74 | +/> |
| 75 | + |
| 76 | +## 4. Add the file upload widget |
| 77 | + |
| 78 | +```python |
| 79 | +input_file = mr.UploadFile(label="Upload your Dataset", accept='.csv', max_file_size='1GB') |
| 80 | +``` |
| 81 | + |
| 82 | +[`UploadFile`](/docs/widgets/uploadfile/) renders a file picker in the app sidebar. |
| 83 | +We restrict it to `.csv` files and allow up to 1 GB. |
| 84 | + |
| 85 | +When the user picks a file: |
| 86 | +- `input_file.name` becomes the filename (non-empty string) |
| 87 | +- `input_file.value` contains the raw file bytes |
| 88 | + |
| 89 | +Mercury automatically re-runs the notebook when the upload changes, so all cells below react immediately. |
| 90 | + |
| 91 | +## 5. Read the CSV |
| 92 | + |
| 93 | +```python |
| 94 | +if input_file.name is not None: |
| 95 | + import pandas as pd |
| 96 | + from io import BytesIO |
| 97 | + from skrub import TableReport |
| 98 | + data = BytesIO(input_file.value) |
| 99 | + df = pd.read_csv(data) |
| 100 | +``` |
| 101 | + |
| 102 | +We wrap the raw bytes in `BytesIO` so pandas can read them directly, without saving the file to disk first. |
| 103 | + |
| 104 | +## 6. Compute the summary statistics |
| 105 | + |
| 106 | +```python |
| 107 | +if input_file.name is not None: |
| 108 | + # shape |
| 109 | + row_count = df.shape[0] |
| 110 | + col_count = df.shape[1] |
| 111 | + |
| 112 | + # missing values |
| 113 | + missing_count = df.isna().sum().sum() |
| 114 | + missing_procent = df.isna().mean().mean() * 100 |
| 115 | + |
| 116 | + # duplicates |
| 117 | + duplicates_count = df.duplicated().sum() |
| 118 | + duplicates_procent = (duplicates_count / row_count) * 100 |
| 119 | +``` |
| 120 | + |
| 121 | +Straightforward pandas — nothing unusual here. We keep both the raw counts and percentages so we can show both in the indicators. |
| 122 | + |
| 123 | +## 7. Detect column types |
| 124 | + |
| 125 | +This is the most interesting part of the app. We go beyond pandas' built-in dtypes to catch edge cases. |
| 126 | + |
| 127 | +```python |
| 128 | + # basic pandas types |
| 129 | + numeric_columns = df.select_dtypes(include=["number"]).columns.tolist() |
| 130 | + boolean_columns = df.select_dtypes(include=["bool"]).columns.tolist() |
| 131 | + datetime_columns = df.select_dtypes(include=["datetime", "datetimetz"]).columns.tolist() |
| 132 | + object_columns = df.select_dtypes(include=["object", "category"]).columns.tolist() |
| 133 | +``` |
| 134 | + |
| 135 | +### Detecting dates stored as text |
| 136 | + |
| 137 | +A very common real-world problem: date columns that look like `"2024-01-15"` but are stored as plain strings. Pandas reads them as `object` dtype and misses them. We catch them manually: |
| 138 | + |
| 139 | +```python |
| 140 | + detected_datetime_columns = [] |
| 141 | + for col in object_columns: |
| 142 | + converted = pd.to_datetime(df[col], errors="coerce") |
| 143 | + valid_ratio = converted.notna().mean() |
| 144 | + if valid_ratio > 0.8: |
| 145 | + detected_datetime_columns.append(col) |
| 146 | + |
| 147 | + datetime_columns = list(set(datetime_columns + detected_datetime_columns)) |
| 148 | +``` |
| 149 | + |
| 150 | +If more than 80% of values in a column parse as a valid date, we treat it as datetime. |
| 151 | + |
| 152 | +### Detecting free-text columns |
| 153 | + |
| 154 | +Long string columns (descriptions, comments, addresses) shouldn't be counted as categorical. We separate them out by average string length: |
| 155 | + |
| 156 | +```python |
| 157 | + text_columns = [] |
| 158 | + for col in object_columns: |
| 159 | + if col not in datetime_columns: |
| 160 | + avg_text_length = df[col].dropna().astype(str).str.len().mean() |
| 161 | + if avg_text_length > 30: |
| 162 | + text_columns.append(col) |
| 163 | +``` |
| 164 | + |
| 165 | +### Constant columns |
| 166 | + |
| 167 | +Columns with only one unique value carry no information and are worth flagging: |
| 168 | + |
| 169 | +```python |
| 170 | + constant_columns = [ |
| 171 | + col for col in all_columns |
| 172 | + if df[col].nunique(dropna=False) <= 1 |
| 173 | + ] |
| 174 | +``` |
| 175 | + |
| 176 | +### Categorical columns |
| 177 | + |
| 178 | +Everything that doesn't fall into numeric, boolean, datetime, or text: |
| 179 | + |
| 180 | +```python |
| 181 | + excluded_from_categorical = set( |
| 182 | + datetime_columns + text_columns + boolean_columns + numeric_columns |
| 183 | + ) |
| 184 | + categorical_columns = [ |
| 185 | + col for col in all_columns |
| 186 | + if col not in excluded_from_categorical |
| 187 | + ] |
| 188 | +``` |
| 189 | + |
| 190 | +## 8. Build the output widgets |
| 191 | + |
| 192 | +### Indicators |
| 193 | + |
| 194 | +```python |
| 195 | + ind_basic = mr.Indicator([ |
| 196 | + mr.Indicator(value=row_count, label="Rows"), |
| 197 | + mr.Indicator(value=col_count, label="Columns"), |
| 198 | + mr.Indicator(value=duplicates_count, label="Duplicate Rows", delta=f"{duplicates_procent:.2f}%"), |
| 199 | + mr.Indicator(value=missing_count, label="Missing Values", delta=f"{missing_procent:.2f}%"), |
| 200 | + ]) |
| 201 | +``` |
| 202 | + |
| 203 | +[`Indicator`](/docs/widgets/indicator/) displays a big number with an optional delta label below it. Nesting multiple `Indicator` objects inside one renders them side by side. |
| 204 | + |
| 205 | +<Image |
| 206 | + src={indicatorsImg} |
| 207 | + alt="Close-up of the four Indicator cards: Rows, Columns, Duplicate Rows (0, 0.00%), Missing Values (866, 8.07%)" |
| 208 | + class="docs-image-100" |
| 209 | +/> |
| 210 | + |
| 211 | +### Column type table |
| 212 | + |
| 213 | +```python |
| 214 | + dane = { |
| 215 | + "Metric": ["Numeric Columns", "Categorical Columns", "Text Columns", |
| 216 | + "Datetime Columns", "Boolean Columns", "Constant Columns"], |
| 217 | + "Value": [len(numeric_columns), len(categorical_columns), len(text_columns), |
| 218 | + len(datetime_columns), len(boolean_columns), len(constant_columns)], |
| 219 | + "Description": [ |
| 220 | + "Columns with numeric values", |
| 221 | + "Columns with categorical values", |
| 222 | + "Columns with longer text values", |
| 223 | + "Columns detected as dates or timestamps", |
| 224 | + "Columns with true/false values", |
| 225 | + "Columns with only one unique value", |
| 226 | + ] |
| 227 | + } |
| 228 | + tabelka = mr.Table(dane) |
| 229 | +``` |
| 230 | + |
| 231 | +[`Table`](/docs/widgets/table/) renders a plain dict or DataFrame as a clean HTML table. |
| 232 | + |
| 233 | +<Image |
| 234 | + src={columnTypeSummaryImg} |
| 235 | + alt="Column Type Summary table with three columns: Metric, Value, and Description — listing numeric, categorical, text, datetime, boolean, and constant column counts" |
| 236 | + class="docs-image-100" |
| 237 | +/> |
| 238 | + |
| 239 | +### Data preview |
| 240 | + |
| 241 | +```python |
| 242 | + report = TableReport(df, n_rows=15) |
| 243 | + display(report) |
| 244 | +``` |
| 245 | + |
| 246 | +`TableReport` from [skrub](https://skrub-data.org/) renders an interactive table with per-column statistics, sortable headers, and value distributions. It does a lot of work for one line of code. |
| 247 | + |
| 248 | +<Image |
| 249 | + src={skrubDataPreviewImg} |
| 250 | + alt="skrub TableReport showing an interactive data preview — sortable columns, per-column value distributions, and the first 15 rows of the uploaded CSV" |
| 251 | + class="docs-image-100" |
| 252 | +/> |
| 253 | + |
| 254 | +## 9. Conditional display |
| 255 | + |
| 256 | +The last cells use a simple pattern to show the right content depending on whether a file has been uploaded: |
| 257 | + |
| 258 | +```python |
| 259 | +if input_file.name is None: |
| 260 | + display(welcome_md) |
| 261 | +else: |
| 262 | + display(title_md) |
| 263 | +``` |
| 264 | + |
| 265 | +```python |
| 266 | +if input_file.name is not None: |
| 267 | + display(ind_basic) |
| 268 | +else: |
| 269 | + clear_output(wait=False) |
| 270 | +``` |
| 271 | + |
| 272 | +The same pattern repeats for the column type table and data preview. `clear_output()` hides the cell output entirely when there's nothing to show, so the app doesn't leave empty gaps on first load. |
| 273 | + |
| 274 | +## 10. Run as a web app |
| 275 | + |
| 276 | +Start the Mercury server from the folder containing the notebook: |
| 277 | + |
| 278 | +```bash |
| 279 | +mercury |
| 280 | +``` |
| 281 | + |
| 282 | +Mercury will detect all `*.ipynb` files and serve them as web applications. |
| 283 | + |
| 284 | +## Notes and tips |
| 285 | + |
| 286 | +- The 80% threshold for datetime detection works well in practice, but you can tune it for stricter or looser detection. |
| 287 | +- The text column threshold of 30 characters is a heuristic. Short categoricals like country names average well under 30; free-text fields like comments average well above. |
| 288 | +- `TableReport` can be slow on very large files. Consider adding `df = df.sample(10_000)` before calling it if you expect datasets with hundreds of thousands of rows. |
| 289 | +- To support Excel files as well, add `accept='.csv,.xlsx'` to `UploadFile` and handle both formats with `pd.read_csv` / `pd.read_excel` based on `input_file.name`. |
0 commit comments