🚀 What's Demokratis?

Consultation procedures for the people

Demokratis.ch | Slack | team@demokratis.ch | 🤗 demokratis

Demokratis is released under the MIT License

🚀 What's Demokratis?

Demokratis.ch makes it easier to participate in Swiss consultation procedures in order to better influence the legislative process at the federal and cantonal level.

About Demokratis

The consultation procedure is a fundamental, but lesser known integral part of Swiss democracy. While in theory the consultation procedure is open to everyone the barriers to participation are rather high. Demokratis.ch is an accessible and user-friendly web platform which makes it easy to explore, contribute to and monitor consultation procedures in Switzerland.

Demokratis is developed and run as a civil society initiative and you are most welcome to join us!

About machine learning at Demokratis

We use machine learning to process and understand the legal text drafts (Vorlagen) that are the subject of the consultation procedure, as well as to process related documents such as reports and letters accompanying the drafts.

The machine learning stack runs separately from the main Demokratis.ch website. The outputs of the models are always reviewed by a human before being displayed on the website.

How to contribute

As a community-driven project in its early stages, we welcome your feedback and contributions! We're excited to collaborate with the civic tech, open data, and data science communities to improve consultation processes for all.

Join us on Slack in the #ml channel to say hello, ask questions, and discuss our data and models.

The challenges of understanding legal text with machine learning are complex. If you have experience in NLP or ML, we’d love your input! We can’t do this alone and appreciate any help or insights you can offer.

Tooling and code quality

We use uv to manage dependencies. After cloning the repository, run uv sync --dev to install all dependencies.
To ensure code quality and enforce a common standard, we use ruff and pre-commit to format code and eliminate common issues. To make sure pre-commit runs all checks automatically when you commit, install the git hooks with uv run pre-commit install.
We've started out with a fairly strict ruff configuration. We expect to loosen up some rules when they become too bothersome. A research project cannot be tied up with the same rules as a big production app. Still, it's a lot easier to start with strict rules and gradually soften them than going the other way around.
All code must be auto-formatted by ruff before being accepted into the repository. pre-commit hooks (or your code editor) will do that for you. To invoke the formatter manually, run uv run ruff format your_file.py. It works on Jupyter notebooks, too.

What data we use

We obtain information about federal and cantonal consultations through APIs and website scraping. For each consultation (Vernehmlassung) we typically collect a number of documents of various types:

The proposed law change (draft, "Vorlage", "Entwurf", ...)
A report explaining the proposed change ("Erläuternder Bericht")
Accompanying letters, questionnaires, synoptic tables etc...

The documents are almost always just PDFs. We also get some metadata for the consultation itself, e.g. its title, starting and ending dates, and perhaps a short description.

See the Pandera schemata in demokratis_ml/data/schemata.py for a complete specification of the data we have on consultations and their documents.

Data acquisition and preprocessing

We use data from two main sources:

Fedlex for federal ("Bund") consultations.
Open Parl Data for cantonal consultations.

Document and consultation data is ingested from these sources into the Demokratis web platform running at Demokratis.ch. The web platform is our main source of truth. In addition to making the data available to end users, it also runs an admin interface that we use for manual review and correction of our database of consultations and their documents.

To transform the web platform data into a dataset for training models, we run a Prefect pipeline: demokratis_ml/pipelines/preprocess_consultation_documents.py. The result of this pipeline is a Parquet file conforming to the above-mentioned dataframe schema.

Our data is public

Our preprocessed dataset is automatically published to HuggingFace and you can download it directly from 🤗 demokratis/consultation-documents. Don't hesitate to talk to us on Slack #ml if you have any questions about the data!

Quickstart: this is how you can directly read our document dataframe and start filtering and exploring it straight away:

import pandas as pd

df = pd.read_parquet(
    "https://huggingface.co/datasets/demokratis/consultation-documents/resolve/main/consultation-documents-preprocessed.parquet",
)
df = df.loc[
    (df["document_language"] == "de")  # "fr" and "it" are also available
    & (df["political_body"] == "ch")  # filter for federal documents
    # & (df["political_body"] == "zh")  # (or filter for a particular canton)
    & (df["document_type"].isin(["DRAFT", "OPINION"]))  # filter for the text of the law and the statements
    & (df["consultation_start_date"].dt.year >= 2010)  # look at recent documents only
]

Our models and open ML problems

Current status

Problem	Public dataset?	Initial research	Proof of concept model	Deployed in production	Languages supported	Notes
I. Classifying consultation topics	✅	✅	✅	✅	de	Only 9 out of 26 topics supported.
II. Extracting structure from documents	✅(*)	✅	❌	❌
III. Classifying document types	✅	✅	✅	✅	de	10 out of 13 types supported; not enough samples to train for the remaining 3. Documents from cantons BL, GE, NE, SZ, VD, VS are not supported due to data quality issues.

*) We haven't published our copies of the source PDFs, but our public dataset does include links to the original files hosted by cantons and the federal government.

I. Classifying consultation topics

We need to classify each new consultation into one or more topics (such as agriculture, energy, health, ...) so that users can easily filter and browse consultations in their area of interest. We also support email notifications, where users can subscribe to receive new consultations on their selected topics by email.

Our dataset

Our dataset – consultations & topics in an M:N relationship – is labelled manually. We also experimented with weak pattern-matching rules and topics coming from Open Parl Data, but these label sources proved too inconsistent with our own labelling guidelines. You can see the full list of our topics in demokratis_ml/data/schemata.py:CONSULTATION_TOPICS.

Our model

For each consultation, we create a vector by concatenating the embedding of the consultation title, the embedding of the publishing organisation name, and the average of the embeddings of several documents pertaining to the consultation. We select these documents by type (see Problem III). We would ideally include an embedding of the consultation's description as well, but we're currently missing descriptions for a large number of consultations.

We found that OpenAI embeddings work better than jina-embeddings-v2-base-de, which in turn works better than general-purpose sentence transformer models.

The model itself is a simple linear pipeline because our small training set size (less than 2,000 labelled consultations) is not supportive of more complex models.

graph LR

subgraph features
  i1[/consultation_title/] --> e1[text-embedding-3-large]
  i2[/organisation_name/] --> e2[text-embedding-3-large]
  i3[/document 1/] --> e3[text-embedding-3-large]
  i4[/document 2/] --> e4[text-embedding-3-large]
  i5[/document .../] --> e5[text-embedding-3-large]
  e3 --> mean
  e4 --> mean
  e5 --> mean
end

subgraph input matrix construction
  e1 ---> hstack["hstack<br>[title | org | documents]<br>[title | org | documents]<br>..."]
  e2 ---> hstack
  mean --> hstack
end

subgraph model
  hstack --> StandardScaler
  StandardScaler --> PCA
  PCA --> LogisticRegression

end

Potential for improvement

We experimented with fine-tuning a domain-specific language model from the 🤗 joelniklaus/legallms collection, training it directly for multi-label classification. These pre-trained models were introduced in the paper MultiLegalPile: A 689GB Multilingual Legal Corpus. This approach showed some promise and we would like to try it again: see #22.

Current results

To get a model usable in production, we've restricted it to just 9 topics for which it performs well. We expect to increase topic coverage as we label more training data.

Label	Precision	Recall	F1-Score	Support
agriculture	1.00	0.82	0.90	11
education	1.00	0.91	0.95	11
energy	1.00	0.92	0.96	12
health	0.79	0.90	0.84	21
insurance	0.83	0.83	0.83	12
migration	1.00	0.70	0.82	10
political_system	1.00	0.60	0.75	5
sports	1.00	1.00	1.00	4
transportation	1.00	0.85	0.92	13

Micro Avg	0.92	0.85	0.88	99
Macro Avg	0.96	0.84	0.89	99
Weighted Avg	0.94	0.85	0.88	99
Samples Avg	0.94	0.86	0.86	99

Code

Model code: demokratis_ml/models/consultation_topics/
Research & training: research/consultation_topics/VM_consultation_topic_classifier.ipynb
Production deployment: demokratis_ml/pipelines/predict_consultation_topics.py

II. Extracting structure from documents

An important goal of Demokratis is to make it easy for people and organisations to provide feedback (statements, Stellungnahmen) on consultations. To facilitate writing comments or suggesting edits on long complex legal documents, we need to break them apart into sections, paragraphs, lists, footnotes etc. Since all the consultation documents we can currently access are PDFs, it is surprisingly hard to extract machine-readable structure from them!

We are still researching the possible solutions to this problem. PR!4 is trying to use LlamaParse to convert PDFs to Markdown. We are also testing the open-source projects surya and docling.

The services typically used for extracting PDFs – AWS Textract, Azure Document AI, Adobe Document Services – all do not seem to be reliable at detecting PDF layouts. In particular, they do not consistently differentiate between headers, paragraphs, lists, or even footnotes.

III. Classifying document types

Each consultation consists of several documents: usually around 5, but sometimes as much as 20 or more. For each document, we're interested in what role it plays in the consultation: is it the actual draft of the proposed change? Is it an accompanying letter or report? (You can see the full list of document types in demokratis_ml/data/schemata.py:DOCUMENT_TYPES.)

For federal consultations, we automatically get this label from the Fedlex API. However, cantonal documents do not have roles (types) assigned, so we need to train a model.

Our datasets

We labelled a part of the cantonal dataset manually and through weak rules on file names (e.g. label files called 'Adressatenliste.pdf' as RECIPIENT_LIST). We also used the entire federal dataset for training because it comes already labelled.

We merge some of the most underrepresented document types into VARIOUS_TEXT (the "everything else" class) before training and evaluation.

Documents from cantons BL, GE, NE, SZ, VD, VS are not used for training and evaluation because we are experiencing many data quality issues, and subsequently bad model performance, for these cantons.

We're only working with German-language documents in training, evaluation, and inference. This is a temporary limitation that we'd like to remove: see issue !26.

Our model

Our classifier uses three types of features:

Document texts embedded with OpenAI's text-embedding-3-large model, with dimensions reduced by PCA
Simple boolean flags extracted by regular expressions on document texts, e.g. "does the text contain a formal greeting like Sehr\s+geehrte[r]?\s+(?:Frau|Herr|Damen\s+und\s+Herren)?")
Some features extracted from the actual PDF documents, e.g. aspect ratio, number of tables, page count,...

We then classify these input vectors with a simple scikit-learn pipeline using StandardScaler and SVC.

Current results

Only manually labelled cantonal documents are used for this evaluation to ensure that we're benchmarking against the most relevant data. In production, the model is only ever used to classify cantonal documents.

Label	Precision	Recall	F1-Score	Support
DRAFT	0.88	0.90	0.89	58
FINAL_REPORT	0.89	0.53	0.67	15
LETTER	0.99	1.00	0.99	79
OPINION	0.57	1.00	0.73	4
RECIPIENT_LIST	1.00	1.00	1.00	37
REPORT	0.81	0.93	0.86	95
SURVEY	1.00	0.91	0.95	11
SYNOPTIC_TABLE	0.93	0.91	0.92	46
VARIOUS_TEXT	0.89	0.72	0.80	58

Accuracy			0.90	403
Macro Avg	0.88	0.88	0.87	403
Weighted Avg	0.90	0.90	0.90	403

Code

Model code: demokratis_ml/models/document_types/
Research & training: research/document_types/VM_document_type_classifier.ipynb
Production deployment: demokratis_ml/pipelines/predict_document_types.py

Name		Name	Last commit message	Last commit date
Latest commit History 239 Commits
.github/workflows		.github/workflows
demokratis_ml		demokratis_ml
docs		docs
research		research
scripts		scripts
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
logo.svg		logo.svg
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 What's Demokratis?

About Demokratis

About machine learning at Demokratis

Table of contents

How to contribute

Tooling and code quality

What data we use

Data acquisition and preprocessing

Our data is public

Our models and open ML problems

Current status

I. Classifying consultation topics

Our dataset

Our model

Potential for improvement

Current results

Code

II. Extracting structure from documents

III. Classifying document types

Our datasets

Our model

Current results

Code

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🚀 What's Demokratis?

About Demokratis

About machine learning at Demokratis

Table of contents

How to contribute

Tooling and code quality

What data we use

Data acquisition and preprocessing

Our data is public

Our models and open ML problems

Current status

I. Classifying consultation topics

Our dataset

Our model

Potential for improvement

Current results

Code

II. Extracting structure from documents

III. Classifying document types

Our datasets

Our model

Current results

Code

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages