Skip to content

Commit 0b6859a

Browse files
committed
chore: improve seeding
1 parent 8eadb70 commit 0b6859a

File tree

11 files changed

+4762
-244
lines changed

11 files changed

+4762
-244
lines changed

.mcp.json

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"mcpServers": {
3+
"ruflo": {
4+
"type": "stdio",
5+
"command": "npx",
6+
"args": ["-y", "ruflo@latest", "mcp", "start"]
7+
}
8+
}
9+
}

dataset/README.md

Lines changed: 128 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,128 @@
1-
# Dataset Overview
2-
3-
This dataset contains a collection of Computer Science education activities from various sources. The activities are categorized by age range, duration, topic, and required resources.
4-
5-
## Dataset Analysis
6-
7-
**Total Activities:** 37
8-
9-
### Activity Sources
10-
- Code.org: 10
11-
- CS Unplugged: 10
12-
- TUM LearnLabs: 8
13-
- Barefoot Computing: 5
14-
- Micro:bit Educational Foundation: 4
15-
16-
### Average Duration Distribution
17-
- 16-30 mins: 12
18-
- 31-45 mins: 7
19-
- 46-60 mins: 18
20-
21-
### Age Suitability Distribution
22-
- Age 6: 6
23-
- Age 7: 13
24-
- Age 8: 24
25-
- Age 9: 26
26-
- Age 10: 32
27-
- Age 11: 27
28-
- Age 12: 20
29-
- Age 13: 14
30-
- Age 14: 12
31-
- Age 15: 6
32-
33-
### Bloom's Taxonomy Levels
34-
- Understand: 2
35-
- Apply: 7
36-
- Analyze: 11
37-
- Evaluate: 2
38-
- Create: 15
39-
40-
### Topics Covered
41-
- Algorithms: 32
42-
- Patterns: 22
43-
- Abstraction: 17
44-
- Decomposition: 14
45-
46-
### Resources Needed
47-
- Stationery: 34
48-
- Handouts: 27
49-
- Computers: 16
50-
- Electronics: 4
51-
- Blocks: 2
1+
# Dataset
2+
3+
This directory contains the activity dataset and tooling used to seed the LEARN-Hub database.
4+
5+
## Contents
6+
7+
| File / Directory | Description |
8+
|---|---|
9+
| `dataset.csv` | Activity metadata and markdown content (camelCase columns) |
10+
| `pdfs/` | Source PDF files referenced by `filename` column in the CSV |
11+
| `seed.py` | CLI tool for seeding activities and exporting markdowns |
12+
13+
## CSV columns
14+
15+
### Metadata
16+
17+
| Column | Type | Description |
18+
|---|---|---|
19+
| `filename` | string | PDF filename in `pdfs/` |
20+
| `name` | string | Activity title |
21+
| `source` | string | Origin / publisher |
22+
| `ageMin` | int | Minimum age |
23+
| `ageMax` | int | Maximum age |
24+
| `format` | string | `digital`, `unplugged`, or `hybrid` |
25+
| `resourcesNeeded` | string | Pipe-separated list of materials |
26+
| `bloomLevel` | string | Bloom's taxonomy level |
27+
| `durationMinMinutes` | int | Minimum duration in minutes |
28+
| `durationMaxMinutes` | int | Maximum duration in minutes |
29+
| `mentalLoad` | string | `low`, `medium`, or `high` |
30+
| `physicalEnergy` | string | `low`, `medium`, or `high` |
31+
| `prepTimeMinutes` | int | Preparation time in minutes |
32+
| `cleanupTimeMinutes` | int | Cleanup time in minutes |
33+
| `topics` | string | Pipe-separated list of topics |
34+
| `description` | string | Activity description |
35+
36+
### Markdown content
37+
38+
These columns are populated via `seed.py export` after activities have been seeded and markdowns generated on the server.
39+
40+
| Column | Description |
41+
|---|---|
42+
| `deckblattMarkdown` | Cover page markdown |
43+
| `artikulationsschemaMarkdown` | Lesson structure (AVIVA+) markdown |
44+
| `hintergrundwissenMarkdown` | Background knowledge markdown |
45+
46+
## Dataset statistics
47+
48+
- **Total activities:** 37
49+
- **Sources:** Code.org (10), CS Unplugged (10), TUM LearnLabs (8), Barefoot Computing (5), Micro:bit (4)
50+
- **Bloom levels:** Create (15), Analyze (11), Apply (7), Understand (2), Evaluate (2)
51+
- **Topics:** Algorithms (32), Patterns (22), Abstraction (17), Decomposition (14)
52+
53+
## Prerequisites
54+
55+
```bash
56+
pip install requests
57+
```
58+
59+
The LEARN-Hub server must be running and accessible.
60+
61+
## Usage
62+
63+
The `seed.py` script has two subcommands: `seed` and `export`.
64+
65+
### Seeding activities
66+
67+
Uploads PDFs and creates activities from the CSV. By default, markdowns are read from the CSV columns. If the CSV has no markdown content (or you want fresh generation), use `--generate-markdown`.
68+
69+
```bash
70+
# Seed all activities using markdowns from CSV (default, fast)
71+
python seed.py seed --password <admin-password>
72+
73+
# Regenerate markdowns via LLM API (slower, requires API key on server)
74+
python seed.py seed --password <admin-password> --generate-markdown
75+
76+
# Seed specific activities only
77+
python seed.py seed --password <admin-password> --only barefoot-pizzaparty.pdf
78+
79+
# Skip activities that already exist on the server
80+
python seed.py seed --password <admin-password> --skip-existing
81+
82+
# Use a different server
83+
python seed.py --base-url https://learnhub-test.aet.cit.tum.de seed --password <admin-password>
84+
```
85+
86+
### Exporting markdowns
87+
88+
Fetches all activities from the server API and writes the generated markdown content back into the CSV. This is useful for persisting LLM-generated markdowns so future seeds can reuse them without calling the LLM again.
89+
90+
```bash
91+
# Export markdowns into dataset.csv (overwrites in-place)
92+
python seed.py export
93+
94+
# Write to a separate file instead
95+
python seed.py export -o dataset_backup.csv
96+
97+
# Export from a different server
98+
python seed.py --base-url https://learnhub-test.aet.cit.tum.de export
99+
```
100+
101+
### Typical workflow
102+
103+
(1. & 2. if you want to re-generate with your model; 3. if you just want data)
104+
105+
1. **First-time seed with LLM generation:**
106+
```bash
107+
python seed.py seed --password secret --generate-markdown
108+
```
109+
110+
2. **Export the generated markdowns into the CSV:**
111+
```bash
112+
python seed.py export
113+
```
114+
115+
3. **Future seeds reuse CSV markdowns (no LLM calls):**
116+
```bash
117+
python seed.py seed --password secret --skip-existing
118+
```
119+
120+
### Environment variables
121+
122+
All CLI flags can also be set via environment variables:
123+
124+
| Variable | Default | Description |
125+
|---|---|---|
126+
| `SEED_BASE_URL` | `http://localhost:5001` | Server base URL |
127+
| `SEED_ADMIN_EMAIL` | `admin@learnhub.com` | Admin email |
128+
| `SEED_ADMIN_PASSWORD` | _(none)_ | Admin password |

0 commit comments

Comments
 (0)