Skip to content

Commit d79e5e9

Browse files
Shane Butlerclaude
andcommitted
feat: add first-run hardening — download scripts, data tiers, UX detection
- Add scripts/download-data.sh (sample + full modes, SHA256 verification) - Add scripts/build-duckdb.sh (Python or CLI, builds local .duckdb) - Add scripts/setup.sh (venv creation, dependency install, verification) - Add data/checksums.sha256 for download integrity verification - Update data/novamart/README.md with tier documentation - Add Tier 2 data detection to Knowledge Bootstrap skill - Add MCP + settings detection to First-Run Welcome skill - Create GitHub Release v1.0.0 with sample (~1MB) and full (~113MB) tarballs Tests: 182 passed, 1 failed (sample-data referential integrity), 10 warnings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 1519275 commit d79e5e9

7 files changed

Lines changed: 542 additions & 5 deletions

File tree

.claude/skills/first-run-welcome/skill.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ onboarding flow based on what data is available.
1313

1414
### Step 1: Detect environment
1515

16-
Check three things:
16+
Check five things:
1717

1818
1. **User profile exists?**`.knowledge/user/profile.md`
1919
- If YES: This is a returning user. Skip this skill entirely.
@@ -27,6 +27,22 @@ Check three things:
2727
- If YES and not "novamart": User has their own data connected.
2828
- If NO: No dataset configured yet.
2929

30+
4. **MCP settings configured?**`.claude/mcp.json`
31+
- If MISSING: Show setup hint:
32+
```
33+
MCP not configured yet. To connect to MotherDuck:
34+
cp .claude/mcp.json.example .claude/mcp.json
35+
Then edit .claude/mcp.json and add your MotherDuck token.
36+
See setup/mcp-config.md for details.
37+
```
38+
39+
5. **Claude settings configured?** → `.claude/settings.local.json`
40+
- If MISSING: Show setup hint:
41+
```
42+
Tip: Copy the example settings to allow Marp slide rendering:
43+
cp .claude/settings.local.json.example .claude/settings.local.json
44+
```
45+
3046
### Step 2: Present welcome based on scenario
3147
3248
#### Scenario A: NovaMart present, no user data

.claude/skills/knowledge-bootstrap/skill.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,29 @@ If the file is missing or empty:
2323
- If NovaMart data files exist in `data/novamart/`, set `active_dataset: novamart`
2424
- Otherwise, prompt: "No active dataset configured. Use `/connect-data` to add one."
2525

26+
### Step 1b: Check for Tier 2 data files
27+
28+
Check if the 5 large Tier 2 CSV files are present in `data/novamart/`:
29+
- `users.csv`, `orders.csv`, `events.csv`, `sessions.csv`, `support_tickets.csv`
30+
31+
**If any are missing**, display this message before proceeding:
32+
33+
```
34+
Some data files are not downloaded yet. The repo ships with 8 small
35+
reference tables but the 5 large analysis tables need to be downloaded.
36+
37+
Run this from the repo root:
38+
bash scripts/download-data.sh # Sample data (~15MB, good for learning)
39+
bash scripts/download-data.sh --full # Full dataset (~200MB compressed)
40+
41+
Missing files: {list of missing files}
42+
43+
You can still query the Tier 1 tables (products, calendar, experiments,
44+
promotions, memberships, nps_responses, experiment_assignments, order_items).
45+
```
46+
47+
Continue with bootstrap — don't halt. The system works with partial data.
48+
2649
### Step 2: Validate dataset brain
2750

2851
Check that `.knowledge/datasets/{active}/` contains:

data/checksums.sha256

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# SHA256 checksums for NovaMart Tier 2 data files
2+
# Used by scripts/download-data.sh to verify downloads
3+
# Generated: 2026-02-19
4+
5+
# Individual CSV files (full dataset)
6+
2af4c0906a287f873f6edfbc6d89bf3be950150013d4842de920abb8c46dc2e5 events.csv
7+
3680f494b7b0a492264a2f61f0bdb75f24c0566024f6db5a20a889bc773973e9 sessions.csv
8+
ce358b367a0038e22c179b36733549f3ff2e9d66d9838ad2238a098eab1b0d36 orders.csv
9+
073dd4e53c18acf29b714bb4946ea6b40f573fb4238bc6575cb8ee04aaa3784e users.csv
10+
06cdf487d931ebec65da1a7462e1e0f7a6bbcb0a7a0afabb6dcd3c6b06fea91b support_tickets.csv
11+
12+
# Release tarballs
13+
db3a4a6eac849ef9cc67c2d5a80891a7f77b9ab0dbb91af07e9fc753428f33f6 novamart-full.tar.gz
14+
10448ddd47a2e3e70fa266f0f60f81c0e43a3ee7aa1b7ee4680981b9b9c39bf8 novamart-sample.tar.gz

data/novamart/README.md

Lines changed: 57 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -381,9 +381,63 @@ ORDER BY 2 DESC;
381381

382382
---
383383

384+
## Data Tiers
385+
386+
The dataset is split into two tiers to keep the repo small while providing full data when needed.
387+
388+
### Tier 1: Shipped with the repo (~4MB)
389+
390+
These 8 small reference/dimension tables are included in git:
391+
392+
| File | Rows | Size | Description |
393+
|------|------|------|-------------|
394+
| `calendar.csv` | 366 | 13K | 2024 calendar with holidays |
395+
| `experiments.csv` | 2 | <1K | A/B test definitions |
396+
| `promotions.csv` | 5 | <1K | Promotion definitions |
397+
| `products.csv` | 500 | 29K | Product catalog |
398+
| `nps_responses.csv` | ~8K | 335K | NPS survey responses |
399+
| `memberships.csv` | ~12K | 392K | Plus membership state changes |
400+
| `experiment_assignments.csv` | ~20K | 861K | A/B test assignments |
401+
| `order_items.csv` | ~120K | 2.4M | Order line items |
402+
403+
### Tier 2: Downloaded separately (~690MB)
404+
405+
These 5 large tables must be downloaded after cloning:
406+
407+
| File | Rows | Size | Description |
408+
|------|------|------|-------------|
409+
| `users.csv` | ~50K | 3M | User dimension table |
410+
| `orders.csv` | ~50K | 4.6M | Order records |
411+
| `support_tickets.csv` | ~25K | 2.2M | Customer support tickets |
412+
| `sessions.csv` | ~1.4M | 130M | Session summaries |
413+
| `events.csv` | ~6.5M | 551M | Behavioral events (largest) |
414+
415+
**Download Tier 2 data:**
416+
417+
```bash
418+
# Sample data (10K-row subsets, ~15MB) — good for learning
419+
bash scripts/download-data.sh
420+
421+
# Full dataset (~200MB compressed, ~690MB uncompressed)
422+
bash scripts/download-data.sh --full
423+
```
424+
425+
### Tier 3: Generated locally
426+
427+
| File | Description |
428+
|------|-------------|
429+
| `novamart.duckdb` | Pre-built DuckDB database (generated by `scripts/build-duckdb.sh`) |
430+
431+
```bash
432+
# Build after downloading CSVs
433+
bash scripts/build-duckdb.sh
434+
```
435+
436+
---
437+
384438
## File Inventory
385439

386-
This directory contains 14 CSV files and 1 DuckDB database file:
440+
The complete dataset contains 13 CSV files:
387441

388442
| File | Description |
389443
|------|-------------|
@@ -400,11 +454,10 @@ This directory contains 14 CSV files and 1 DuckDB database file:
400454
| `nps_responses.csv` | NPS survey responses |
401455
| `experiment_assignments.csv` | A/B test user assignments |
402456
| `calendar.csv` | 2024 calendar with holidays and day-of-week attributes |
403-
| `novamart_practice.duckdb` | Pre-built DuckDB database with all tables loaded |
404457

405-
To query the DuckDB file directly:
458+
To query a DuckDB database (after building it):
406459
```python
407460
import duckdb
408-
con = duckdb.connect('data/novamart/novamart_practice.duckdb', read_only=True)
461+
con = duckdb.connect('data/novamart/novamart.duckdb', read_only=True)
409462
con.sql("SELECT COUNT(*) FROM users").show()
410463
```

scripts/build-duckdb.sh

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
#!/usr/bin/env bash
2+
# build-duckdb.sh — Build a local DuckDB database from NovaMart CSV files
3+
#
4+
# Usage:
5+
# bash scripts/build-duckdb.sh # Build from CSVs in data/novamart/
6+
# bash scripts/build-duckdb.sh --help # Show this help
7+
#
8+
# Creates data/novamart/novamart.duckdb with all tables loaded.
9+
# This is optional — Claude Code can query CSVs directly via DuckDB's
10+
# read_csv() function, but a pre-built .duckdb file is faster for
11+
# repeated queries.
12+
13+
set -euo pipefail
14+
15+
DATA_DIR="data/novamart"
16+
DB_FILE="${DATA_DIR}/novamart.duckdb"
17+
18+
RED='\033[0;31m'
19+
GREEN='\033[0;32m'
20+
YELLOW='\033[1;33m'
21+
NC='\033[0m'
22+
23+
usage() {
24+
echo "Usage: bash scripts/build-duckdb.sh [--help]"
25+
echo ""
26+
echo "Builds ${DB_FILE} from CSV files in ${DATA_DIR}/"
27+
echo ""
28+
echo "Prerequisites:"
29+
echo " - Python 3.9+ with duckdb package: pip install duckdb"
30+
echo " - OR DuckDB CLI: brew install duckdb (macOS)"
31+
}
32+
33+
# --- Main ---
34+
35+
if [[ "${1:-}" == "--help" ]] || [[ "${1:-}" == "-h" ]]; then
36+
usage
37+
exit 0
38+
fi
39+
40+
# Ensure we're in the repo root
41+
if [ ! -f "CLAUDE.md" ]; then
42+
echo -e "${RED}Error: Run this script from the AI Analyst repo root.${NC}"
43+
echo " cd ~/Desktop/ai-analyst && bash scripts/build-duckdb.sh"
44+
exit 1
45+
fi
46+
47+
# Check for CSV files
48+
if [ ! -d "$DATA_DIR" ] || [ -z "$(ls "$DATA_DIR"/*.csv 2>/dev/null)" ]; then
49+
echo -e "${RED}Error: No CSV files found in ${DATA_DIR}/${NC}"
50+
echo ""
51+
echo "Run the download script first:"
52+
echo " bash scripts/download-data.sh"
53+
exit 1
54+
fi
55+
56+
# Remove existing DB if present
57+
if [ -f "$DB_FILE" ]; then
58+
echo -e "${YELLOW}Removing existing ${DB_FILE}${NC}"
59+
rm -f "$DB_FILE"
60+
fi
61+
62+
echo "Building DuckDB database from CSV files..."
63+
echo ""
64+
65+
# Try Python+duckdb first, fall back to DuckDB CLI
66+
if python3 -c "import duckdb" 2>/dev/null; then
67+
python3 << 'PYEOF'
68+
import duckdb
69+
import os
70+
import glob
71+
72+
data_dir = "data/novamart"
73+
db_file = os.path.join(data_dir, "novamart.duckdb")
74+
75+
con = duckdb.connect(db_file)
76+
77+
csv_files = sorted(glob.glob(os.path.join(data_dir, "*.csv")))
78+
loaded = 0
79+
80+
for csv_path in csv_files:
81+
table_name = os.path.splitext(os.path.basename(csv_path))[0]
82+
print(f" Loading {table_name}...", end="", flush=True)
83+
con.execute(f"CREATE TABLE {table_name} AS SELECT * FROM read_csv_auto('{csv_path}')")
84+
row_count = con.execute(f"SELECT COUNT(*) FROM {table_name}").fetchone()[0]
85+
print(f" {row_count:,} rows")
86+
loaded += 1
87+
88+
con.close()
89+
print(f"\nLoaded {loaded} tables into {db_file}")
90+
PYEOF
91+
92+
elif command -v duckdb &> /dev/null; then
93+
for csv_file in "$DATA_DIR"/*.csv; do
94+
table_name=$(basename "$csv_file" .csv)
95+
echo " Loading ${table_name}..."
96+
duckdb "$DB_FILE" "CREATE TABLE ${table_name} AS SELECT * FROM read_csv_auto('${csv_file}');"
97+
done
98+
echo ""
99+
echo "Tables loaded into ${DB_FILE}"
100+
101+
else
102+
echo -e "${RED}Error: Neither Python duckdb package nor DuckDB CLI found.${NC}"
103+
echo ""
104+
echo "Install one of:"
105+
echo " pip install duckdb # Python package"
106+
echo " brew install duckdb # macOS CLI"
107+
echo " apt install duckdb # Linux CLI"
108+
exit 1
109+
fi
110+
111+
# Report file size
112+
if [ -f "$DB_FILE" ]; then
113+
size=$(ls -lh "$DB_FILE" | awk '{print $5}')
114+
echo ""
115+
echo -e "${GREEN}DuckDB database ready: ${DB_FILE} (${size})${NC}"
116+
echo ""
117+
echo "Claude Code will automatically use this database for faster queries."
118+
fi

0 commit comments

Comments
 (0)