|
1 | | -# Dataset Overview |
2 | | - |
3 | | -This dataset contains a collection of Computer Science education activities from various sources. The activities are categorized by age range, duration, topic, and required resources. |
4 | | - |
5 | | -## Dataset Analysis |
6 | | - |
7 | | -**Total Activities:** 37 |
8 | | - |
9 | | -### Activity Sources |
10 | | -- Code.org: 10 |
11 | | -- CS Unplugged: 10 |
12 | | -- TUM LearnLabs: 8 |
13 | | -- Barefoot Computing: 5 |
14 | | -- Micro:bit Educational Foundation: 4 |
15 | | - |
16 | | -### Average Duration Distribution |
17 | | -- 16-30 mins: 12 |
18 | | -- 31-45 mins: 7 |
19 | | -- 46-60 mins: 18 |
20 | | - |
21 | | -### Age Suitability Distribution |
22 | | -- Age 6: 6 |
23 | | -- Age 7: 13 |
24 | | -- Age 8: 24 |
25 | | -- Age 9: 26 |
26 | | -- Age 10: 32 |
27 | | -- Age 11: 27 |
28 | | -- Age 12: 20 |
29 | | -- Age 13: 14 |
30 | | -- Age 14: 12 |
31 | | -- Age 15: 6 |
32 | | - |
33 | | -### Bloom's Taxonomy Levels |
34 | | -- Understand: 2 |
35 | | -- Apply: 7 |
36 | | -- Analyze: 11 |
37 | | -- Evaluate: 2 |
38 | | -- Create: 15 |
39 | | - |
40 | | -### Topics Covered |
41 | | -- Algorithms: 32 |
42 | | -- Patterns: 22 |
43 | | -- Abstraction: 17 |
44 | | -- Decomposition: 14 |
45 | | - |
46 | | -### Resources Needed |
47 | | -- Stationery: 34 |
48 | | -- Handouts: 27 |
49 | | -- Computers: 16 |
50 | | -- Electronics: 4 |
51 | | -- Blocks: 2 |
| 1 | +# Dataset |
| 2 | + |
| 3 | +This directory contains the activity dataset and tooling used to seed the LEARN-Hub database. |
| 4 | + |
| 5 | +## Contents |
| 6 | + |
| 7 | +| File / Directory | Description | |
| 8 | +|---|---| |
| 9 | +| `dataset.csv` | Activity metadata and markdown content (camelCase columns) | |
| 10 | +| `pdfs/` | Source PDF files referenced by `filename` column in the CSV | |
| 11 | +| `seed.py` | CLI tool for seeding activities and exporting markdowns | |
| 12 | + |
| 13 | +## CSV columns |
| 14 | + |
| 15 | +### Metadata |
| 16 | + |
| 17 | +| Column | Type | Description | |
| 18 | +|---|---|---| |
| 19 | +| `filename` | string | PDF filename in `pdfs/` | |
| 20 | +| `name` | string | Activity title | |
| 21 | +| `source` | string | Origin / publisher | |
| 22 | +| `ageMin` | int | Minimum age | |
| 23 | +| `ageMax` | int | Maximum age | |
| 24 | +| `format` | string | `digital`, `unplugged`, or `hybrid` | |
| 25 | +| `resourcesNeeded` | string | Pipe-separated list of materials | |
| 26 | +| `bloomLevel` | string | Bloom's taxonomy level | |
| 27 | +| `durationMinMinutes` | int | Minimum duration in minutes | |
| 28 | +| `durationMaxMinutes` | int | Maximum duration in minutes | |
| 29 | +| `mentalLoad` | string | `low`, `medium`, or `high` | |
| 30 | +| `physicalEnergy` | string | `low`, `medium`, or `high` | |
| 31 | +| `prepTimeMinutes` | int | Preparation time in minutes | |
| 32 | +| `cleanupTimeMinutes` | int | Cleanup time in minutes | |
| 33 | +| `topics` | string | Pipe-separated list of topics | |
| 34 | +| `description` | string | Activity description | |
| 35 | + |
| 36 | +### Markdown content |
| 37 | + |
| 38 | +These columns are populated via `seed.py export` after activities have been seeded and markdowns generated on the server. |
| 39 | + |
| 40 | +| Column | Description | |
| 41 | +|---|---| |
| 42 | +| `deckblattMarkdown` | Cover page markdown | |
| 43 | +| `artikulationsschemaMarkdown` | Lesson structure (AVIVA+) markdown | |
| 44 | +| `hintergrundwissenMarkdown` | Background knowledge markdown | |
| 45 | + |
| 46 | +## Dataset statistics |
| 47 | + |
| 48 | +- **Total activities:** 37 |
| 49 | +- **Sources:** Code.org (10), CS Unplugged (10), TUM LearnLabs (8), Barefoot Computing (5), Micro:bit (4) |
| 50 | +- **Bloom levels:** Create (15), Analyze (11), Apply (7), Understand (2), Evaluate (2) |
| 51 | +- **Topics:** Algorithms (32), Patterns (22), Abstraction (17), Decomposition (14) |
| 52 | + |
| 53 | +## Prerequisites |
| 54 | + |
| 55 | +```bash |
| 56 | +pip install requests |
| 57 | +``` |
| 58 | + |
| 59 | +The LEARN-Hub server must be running and accessible. |
| 60 | + |
| 61 | +## Usage |
| 62 | + |
| 63 | +The `seed.py` script has two subcommands: `seed` and `export`. |
| 64 | + |
| 65 | +### Seeding activities |
| 66 | + |
| 67 | +Uploads PDFs and creates activities from the CSV. By default, markdowns are read from the CSV columns. If the CSV has no markdown content (or you want fresh generation), use `--generate-markdown`. |
| 68 | + |
| 69 | +```bash |
| 70 | +# Seed all activities using markdowns from CSV (default, fast) |
| 71 | +python seed.py seed --password <admin-password> |
| 72 | + |
| 73 | +# Regenerate markdowns via LLM API (slower, requires API key on server) |
| 74 | +python seed.py seed --password <admin-password> --generate-markdown |
| 75 | + |
| 76 | +# Seed specific activities only |
| 77 | +python seed.py seed --password <admin-password> --only barefoot-pizzaparty.pdf |
| 78 | + |
| 79 | +# Skip activities that already exist on the server |
| 80 | +python seed.py seed --password <admin-password> --skip-existing |
| 81 | + |
| 82 | +# Use a different server |
| 83 | +python seed.py --base-url https://learnhub-test.aet.cit.tum.de seed --password <admin-password> |
| 84 | +``` |
| 85 | + |
| 86 | +### Exporting markdowns |
| 87 | + |
| 88 | +Fetches all activities from the server API and writes the generated markdown content back into the CSV. This is useful for persisting LLM-generated markdowns so future seeds can reuse them without calling the LLM again. |
| 89 | + |
| 90 | +```bash |
| 91 | +# Export markdowns into dataset.csv (overwrites in-place) |
| 92 | +python seed.py export |
| 93 | + |
| 94 | +# Write to a separate file instead |
| 95 | +python seed.py export -o dataset_backup.csv |
| 96 | + |
| 97 | +# Export from a different server |
| 98 | +python seed.py --base-url https://learnhub-test.aet.cit.tum.de export |
| 99 | +``` |
| 100 | + |
| 101 | +### Typical workflow |
| 102 | + |
| 103 | +(1. & 2. if you want to re-generate with your model; 3. if you just want data) |
| 104 | + |
| 105 | +1. **First-time seed with LLM generation:** |
| 106 | + ```bash |
| 107 | + python seed.py seed --password secret --generate-markdown |
| 108 | + ``` |
| 109 | + |
| 110 | +2. **Export the generated markdowns into the CSV:** |
| 111 | + ```bash |
| 112 | + python seed.py export |
| 113 | + ``` |
| 114 | + |
| 115 | +3. **Future seeds reuse CSV markdowns (no LLM calls):** |
| 116 | + ```bash |
| 117 | + python seed.py seed --password secret --skip-existing |
| 118 | + ``` |
| 119 | + |
| 120 | +### Environment variables |
| 121 | + |
| 122 | +All CLI flags can also be set via environment variables: |
| 123 | + |
| 124 | +| Variable | Default | Description | |
| 125 | +|---|---|---| |
| 126 | +| `SEED_BASE_URL` | `http://localhost:5001` | Server base URL | |
| 127 | +| `SEED_ADMIN_EMAIL` | `admin@learnhub.com` | Admin email | |
| 128 | +| `SEED_ADMIN_PASSWORD` | _(none)_ | Admin password | |
0 commit comments