Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
# Changelog

## 2026-04-25
- [BW]: Schools that do not have a Dienststellennummer change their ID
from an unstable UUID to a more stable hash of name, address, zip and city.
(that is schools that were `BW-UUID-<something>` become `BW-FB-<something>`)
⚠️ This breaks existing ids.
Schools that are in the dataset right now with UUID ids will be dropped
from the data that is published through the API and CSV-export.

## 2025-05-12
- [SL]: Switch to data from Geoportal instead of web scraping. We do not get
contact details such as email and phone anymore but we (might) get more
Expand Down
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,11 @@ In details, the IDs are sourced as follows:

|State| ID-Source | example-id |stable|
|-----|--------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------|------|
|BW| DISCH (Dienststellenschlüssel) extracted from email, fallback to WFS UUID when not available | `BW-04154817` or `BW-UUID-00000a15-a965-4999-b9ad-05895eb0fad2` |✅ likely (~80% with DISCH, ~20% UUID fallback)|
|BW| DISCH (Dienststellenschlüssel) extracted from email, fallback to address hash when not available (see below) | `BW-04154817` or `BW-FB-e5c29cbf7215726b4f3515cfad6bee63e2a0bb8ded432a34e9e51c4324ec52ea` |✅ likely (~80% with DISCH, ~20% fallback)|
|BY| id from the WFS service | `BY-SCHUL_SCHULSTANDORTEGRUNDSCHULEN_2acb7d31-915d-40a9-adcf-27b38251fa48` |❓ unlikely (although we reached out to ask for canonical IDs to be published)|
|BE| Field `bsn` (Berliner Schulnummer) from the WFS Service | `BE-02K10` |✅ likely|
|BB| Field `schul_nr` (Schulnummer) from thw WFS Service | `BB-111430` |✅ likely|
|HB| Field `snr_txt` (Schulnummer) from the INSPIRE shapefile - official 3-digit ID used in Bremen materials | `HB-002` |✅ likely|
|HB| Field `snr_txt` (Schulnummer) from the INSPIRE shapefile - official 3-digit ID used in Bremen materials | `HB-002` |✅ likely|
|HH| Field `schul_id` From the WFS Service | `HH-7910-0` |✅ likely|
|HE| `school_no` URL query param of the schools's details page (identical to the Dienststellennummer) | `HE-4024` |✅ likely|
|MV| Field `dstnr` from the WFS | `MV-75130302` |✅ likely|
Expand All @@ -38,6 +38,9 @@ In details, the IDs are sourced as follows:
|ST| `OBJECTID` from the ArcGIS FeatureServer API (prefixed with `ARC`) | `ST-ARC00001` |❓ unlikely (OBJECTID may change on data reimport)|
|TH| `Schulnummer` from the WFS service | `TH-10601` |✅ likely|

For Baden-Württemberg, not all schools have a Dienststellenschlüssel that we can extract. For those who don't,
we join name, address, zip and city with a " " between each part and generate the SHA256 hash.

## Geolocations
When available, we try to use the geolocations provided by the data publishers.

Expand Down
17 changes: 12 additions & 5 deletions jedeschule/spiders/baden_wuerttemberg.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import hashlib

import re
import scrapy
from scrapy import Item
Expand Down Expand Up @@ -32,6 +34,8 @@ def extract_disch(email: str | None) -> str | None:
match = DISCH_RE.search(email.strip())
return match.group(1) if match else None

def create_address_based_fallback(address, city, zip):
return hashlib.sha256(f"{address} {zip} {city}").hexdigest()

class BadenWuerttembergSpider(SchoolSpider):
name = "baden-wuerttemberg"
Expand Down Expand Up @@ -136,13 +140,16 @@ def parse(self, response):

@staticmethod
def normalize(item: Item) -> School:
# Prefer DISCH (stable government ID) over UUID when available
disch = item.get("disch")
uuid = item.get("uuid")
school_id = f"BW-{disch}" if disch else f"BW-UUID-{uuid}"
def id():
# Prefer DISCH (stable government ID) when available
if disch := item.get('disch'):
return f'{disch}'
key = " ".join([item.get(key) or "" for key in ['name', 'address', 'zip', 'city']])
key_hash = hashlib.sha256(key.encode('utf-8')).hexdigest()
return f'FB-{key_hash}'

return School(
id=school_id,
id=f"BW-{id()}",
name=item.get("name"),
address=item.get("address"),
zip=item.get("zip"),
Expand Down