Data quality assessment toolkit for STRADA (Swedish Traffic Accident Data Acquisition) datasets.
STRADA is a national information system for road traffic injuries managed by the Swedish Transport Agency (Transportstyrelsen). This toolbox automates data-quality checks for the two core STRADA tables — Olyckor (Crashes) and Personer (Persons) — and provides both a command-line interface and a web dashboard so that researchers with any level of coding experience can use it.
- Quick Start
- Installation
- Usage — Command-Line Interface (CLI)
- Usage — Web Dashboard
- Verification Checks Reference
- Micromobility Classification
- Report Formats
- Project Structure
- Configuration & Customisation
- Workflow Diagram
- Contributing
- License
# 1. Install
cd STRADA_toolbox
pip install .
# 2. Run all generic data-quality checks
strada verify \
--olyckor path/to/Olyckor.csv \
--personer path/to/Personer.csv
# 3. Include cycling-specific checks
strada verify \
--olyckor path/to/Olyckor.csv \
--personer path/to/Personer.csv \
--cycling
# 4. Or launch the web dashboard (no terminal needed after this)
strada web- Python 3.9+
- The STRADA data files (
.xlsxworkbook or pre-exported.csvfiles)
# Clone / download this repository
cd STRADA_toolbox
# Option A: install in editable mode (recommended for development)
pip install -e .
# Option B: install normally
pip install .The web dashboard uses Streamlit which is included as an optional dependency:
pip install -e ".[web]"python -m venv .venv
# Windows
.venv\Scripts\Activate.ps1
# macOS / Linux
source .venv/bin/activate
pip install -e ".[web]"pip install -r requirements.txtAfter installation, the strada command is available in your terminal.
Run strada --help to see all commands:
Usage: strada [OPTIONS] COMMAND [ARGS]...
STRADA Data Quality Assessment Toolkit
╭─ Commands ────────────────────────────────────────────────────╮
│ preprocess Convert a STRADA Excel workbook to CSV │
│ verify Run data-quality verification checks │
│ classify Classify micromobility types (cycling analysis) │
│ web Launch the web dashboard │
╰───────────────────────────────────────────────────────────────╯
Converts a STRADA Excel workbook (.xlsx) into two CSV files and optionally filters by year range.
strada preprocess \
--excel-file "Olyckor_Personer_2005-2024.xlsx" \
--output-dir ./data \
--start-year 2016 \
--end-year 2024| Option | Description |
|---|---|
--excel-file, -e |
Path to the .xlsx workbook (required) |
--output-dir, -o |
Directory for output CSV files (required) |
--start-year |
Start of year filter (inclusive) |
--end-year |
End of year filter (inclusive) |
--olyckor-sheet |
Sheet name for crashes (default: Olyckor) |
--personer-sheet |
Sheet name for persons (default: Personer) |
What it does:
- Reads the
OlyckorandPersonersheets from the Excel file - Replaces in-cell line breaks (
\n,\r) with spaces - Saves
Olyckor.csvandPersoner.csvin the output directory - If year range is given, also saves
Olyckor-2016-2024.csvandPersoner-2016-2024.csv
Runs data-quality verification checks on a pair of CSV files.
# Run all generic checks
strada verify \
--olyckor Olyckor.csv \
--personer Personer.csv
# Include cycling-specific checks
strada verify \
--olyckor Olyckor.csv \
--personer Personer.csv \
--cycling
# Run only specific checks
strada verify \
--olyckor Olyckor.csv \
--personer Personer.csv \
--checks G1 G4 G5
# Change output directory and format
strada verify \
--olyckor Olyckor.csv \
--personer Personer.csv \
--output-dir ./reports \
--format csv| Option | Description |
|---|---|
--olyckor |
Path to crashes CSV (required) |
--personer |
Path to persons CSV (required) |
--output-dir, -o |
Directory for reports (default: .) |
--cycling |
Include cycling-specific checks C1–C3 |
--checks |
Space-separated check IDs to run (e.g. G1 G4 C2) |
--format |
Report format: txt, csv, or both (default: both) |
Output files:
strada_quality_report.txt— Human-readable text reportstrada_quality_report.csv— Machine-readable CSV (one row per issue)
Classifies Cykel entries into micromobility types and adds a conflict-partner column.
strada classify \
--personer Personer-verified.csv \
--output-dir ./data \
--output-name Personer-analysis-ready.csv| Option | Description |
|---|---|
--personer |
Path to persons CSV (required) |
--output-dir, -o |
Directory for output (default: .) |
--output-name |
Output file name (default: Personer-analysis-ready.csv) |
What it adds:
Micromobility_typecolumn:Conventional bicycle,E-bike,E-scooter,rullstol/permobil,other_micromobility,Unknown, orN/A(non-Cykel rows)Conflict_partnercolumn: Other road-user types in the same crash (e.g.Personbil,Fotgängare), orSinglefor single-vehicle crashes
strada web # default port 8501
strada web --port 8080 # custom portOpens a browser-based dashboard. See the Web Dashboard section for details.
The web dashboard provides the same functionality as the CLI but through a graphical interface. It is designed for users who are less comfortable with command-line tools.
strada webThis opens your browser at http://localhost:8501 with four tabs:
- Upload your Olyckor and Personer CSV files
- Select which checks to run (checkboxes for each G1–G6 and C1–C3)
- Click ▶ Run selected checks
- Browse results interactively in expandable tables
- Download text or CSV reports
- Upload your Personer CSV
- Click ▶ Run classification
- View the micromobility type distribution
- Download the classified dataset
- Upload a STRADA Excel workbook
- Optionally set a year range filter
- Click ▶ Convert
- Download the resulting CSV files
Documentation and links.
These checks apply to any STRADA analysis, regardless of road-user type.
Verifies that every Olycksnummer in the Olyckor dataset has at least one matching entry in the Personer dataset, and vice versa.
- Why it matters: Missing crash IDs indicate data extraction issues or incomplete joins.
- What is flagged: IDs that exist in one dataset but not the other.
Two sub-checks:
-
G2.1: Checks for missing
Olyckstypvalues in both datasets. -
G2.2: For each crash ID present in both datasets, verifies that the
Olyckstypvalue matches. -
Why it matters: Inconsistent crash types between datasets may indicate data entry errors or misaligned records.
Four sub-checks on the Personer dataset:
-
G3.1: At least one of the three category columns (
Trafikantkategori (P) - Undergrupp,Trafikantkategori (S) - Undergrupp,Sammanvägd Trafikantkategori - Undergrupp) must be filled. -
G3.2: When both P and S are filled, they should match.
-
G3.3: When P or S is filled, it should match
Sammanvägd(allows prefix matching, e.g."Lastbil (lätt)"matches"Lastbil"). -
G3.4: When both P and S are filled, at least one should match
Sammanvägd. -
Why it matters: The
Sammanvägd(combined) category is derived from P (Police) and S (Hospital) reports. Discrepancies may indicate classification errors.
For each crash with multiple person entries, verifies that:
- The date (
År,Månad,Dag) is the same across all entries. - The time (
Klockslag grupp (timme)) is the same across all entries.
Date mismatches are reported first, followed by time mismatches sorted by the magnitude of the time difference.
- Why it matters: All persons in the same crash should have the same date and time.
For each crash with multiple person entries, verifies that Län (county) and Kommun (municipality) are consistent.
- Why it matters: All persons in the same crash should be at the same location.
Identifies potential duplicate person entries across different crashes. Groups persons by:
- Age (
Ålder), Gender (Kön) - Date (
År,Månad,Dag), Time (Klockslag grupp (timme)) - Location (
Län,Kommun,Olycksväg/-gata) - Road-user type (
Sammanvägd Trafikantkategori - Huvudgrupp)
If the same combination of all these values appears in multiple different crash IDs, it is flagged as a potential duplicate. Rows with missing age or unknown gender are excluded.
- Why it matters: The same traffic incident may have been registered as multiple separate crashes. Including the road-user type ensures that different road users at the same time/place are not incorrectly flagged.
These checks are relevant when the dataset has been filtered to cycling / micromobility crashes. Enable them with --cycling.
For crashes typed G1 (cykel singel):
- There should be exactly one person entry.
- That entry should have
Sammanvägd Trafikantkategori - Huvudgrupp == "Cykel". - When multiple persons exist, the count of passengers (identified by
"Passagerare"in role columns) is reported.
Verifies that every crash has at least one person with Huvudgrupp == "Cykel". Relevant only when the dataset was extracted as a cycling dataset.
Flags crashes where all Cykel entries are passengers (no driver/cyclist). This can indicate a data-entry issue where the cyclist is missing from the record.
The classify command / Classify tab is specific to cycling/micromobility analyses. It processes the free-text event descriptions (Händelseförlopp (P) and (S)) to determine whether each Cykel entry is:
| Type | Description |
|---|---|
Conventional bicycle |
Standard pedal-powered bicycle |
E-bike |
Electrically assisted bicycle |
E-scooter |
Electric kick-scooter (elsparkcykel) |
rullstol/permobil |
Wheelchair / powered wheelchair |
other_micromobility |
Skateboard, hoverboard, moped, etc. |
Unknown |
Both event description columns are empty |
N/A |
Not a Cykel entry |
- Priority column:
Händelseförlopp (P)is checked first;(S)is used only if(P)is empty. - Keyword matching: Case-insensitive search for Swedish keywords (e.g., "elcykel", "elsparkcykel", "voi"). Brand names like "voi", "lime", "bird" use whole-word matching to avoid false positives.
- Multi-match resolution: If multiple categories match, priority order is: E-scooter > E-bike > rullstol/permobil > other_micromobility > Conventional bicycle.
- Fallback: If no keywords match, the
Sammanvägd Trafikantkategori - Undergruppcolumn is checked (Elcykel→ E-bike,Eldrivet enpersonsfordon→ E-scooter).
The Conflict_partner column lists the road-user types of the other persons involved in the same crash. For single-vehicle crashes, the value is "Single".
Human-readable summary with:
- Overview table showing pass/fail status for each check
- Detailed sections listing every flagged record
- Suitable for quick review and documentation
Machine-readable table with columns:
| Column | Description |
|---|---|
check_id |
Check identifier (e.g. G1, G3.2) |
check_name |
Human-readable check name |
crash_id |
Affected Olycksnummer |
issue |
Summary of the issue |
details |
Semicolon-separated key=value pairs |
This format is ideal for:
- Opening in Excel for review
- Filtering and sorting issues
- Programmatic downstream processing
STRADA_toolbox/
├── pyproject.toml # Package build configuration
├── requirements.txt # Dependencies (alternative to pip install .)
├── README.md # This file
│
└── strada/ # Python package
├── __init__.py
├── cli.py # Typer CLI (entry point: strada)
├── app.py # Streamlit web dashboard
│
├── config/
│ ├── __init__.py # Re-exports from constants
│ └── constants.py # All column names, keywords, magic strings
│
├── core/
│ ├── __init__.py
│ ├── preprocess.py # Excel→CSV conversion, year filtering
│ ├── verify.py # All 9 verification checks (G1–G6, C1–C3)
│ └── classify.py # Micromobility classification
│
└── io/
├── __init__.py
├── readers.py # CSV / Excel loading with encoding handling
└── reporters.py # Text and CSV report generation
- Separation of concerns: Core logic (
core/) is independent of the interface. Bothcli.pyandapp.pycall the same functions. - Centralised constants: All column names, keywords, and magic strings are in
config/constants.py. If the STRADA schema changes, only one file needs updating. - Structured results: Every check returns a
VerificationResultdataclass, making it easy to add new report formats or interfaces. - No hardcoded paths: All file paths are passed as arguments.
To add or remove micromobility keywords, edit strada/config/constants.py:
MICROMOBILITY_KEYWORDS = {
"E-scooter": [
"elscooter", "elspark", ...
# Add your keywords here
],
...
}- Create a new function in
strada/core/verify.pyfollowing the pattern:
def check_g7_my_new_check(df_olyckor, df_personer) -> VerificationResult:
# ... your logic ...
return VerificationResult(
check_id="G7",
check_name="My new check",
status="pass" if no_issues else "warning",
summary="...",
issue_count=n,
details=df_details,
)- Add it to the
GENERIC_CHECKSorCYCLING_CHECKSlist at the bottom of the file. - The CLI and web dashboard will automatically pick it up.
All column names are defined as constants in strada/config/constants.py. If a STRADA export uses different column names, update the constants there.
┌────────────────────┐
│ STRADA Excel file │
│ (.xlsx workbook) │
└────────┬───────────┘
│
▼
┌────────────────────┐
│ strada preprocess │ ← Converts Excel → CSV, optional year filter
│ │
│ Output: │
│ • Olyckor.csv │
│ • Personer.csv │
└────────┬───────────┘
│
▼
┌────────────────────┐
│ strada verify │ ← Runs G1–G6 (generic) + C1–C3 (cycling, optional)
│ │
│ Output: │
│ • .txt report │
│ • .csv report │
└────────┬───────────┘
│
│ (User reviews report, decides which records
│ to exclude from analysis)
│
▼
┌────────────────────┐
│ strada classify │ ← Cycling-specific: E-scooter / E-bike / etc.
│ (optional) │
│ │
│ Output: │
│ • Personer- │
│ analysis- │
│ ready.csv │
└────────────────────┘
- Fork this repository
- Create a feature branch (
git checkout -b feature/my-new-check) - Make your changes and add tests
- Run
pip install -e ".[dev]"andpytest - Submit a pull request
MIT License. See LICENSE for details.
Developed for the Swedish STRADA research community.