Skip to content

fix(notubiz): missing documents from several municipalities #513

Open
@BluntKatana

Description

Problem

I've found that in the notubiz API there are several 'hidden' agenda items and documents which are currently not being scraped resulting in a large difference between documents actually available on the municipalities sites and on ORI.

There are two main issues I have found. Both of which are related to the agenda items properties. Currently an agenda item from a meeting (.agenda_items[]) is parsed only on the .documents[]. However there are two more properties which are interesting:

  1. .module_items[]: A module item in itself does not look interesting (see below). But once fetching this item using the .self-property we find that a module item can have several documents containing it.
    (see point 6 on municipality website)
    (see 7th agenda item: https://api.notubiz.nl/events/meetings/1152031?format=json&version=1.17.0)
Image
  1. .agenda_items[]: An agenda item itself can contain several more agenda items which (again) do not look interesting at first (see below), but when fetching them outright they can ofcourse contain documents again (and even more agenda items or module items..)
    (they have a special suffix on the municipality website)
    (see 14th agenda item: https://api.notubiz.nl/events/meetings/1161553?format=json&version=1.17.0)
Image

Some examples of missing documents

(note that my simple scraper is also missing some documents atm, but has better coverage for the notubiz api)

Breda

year scraped_from_notubiz scraped_from_ori in_notubiz_not_in_ori in_ori_not_in_notubiz
2014 0 126 0 126
2015 0 578 0 578
2016 3740 1000 2856 116
2017 1745 1000 933 188
2018 431 243 198 10
2019 1915 218 1698 1
2020 227 157 72 2
2021 193 184 10 1
2022 220 186 37 3
2023 240 221 22 3
2024 206 155 56 5

Waddinxveen

year scraped_from_notubiz scraped_from_ori in_notubiz_not_in_ori in_ori_not_in_notubiz
2014 0 964 0 964
2015 0 1000 0 1000
2016 5166 1000 4391 225
2017 2362 0 2362 0
2018 1514 988 601 75
2019 1603 993 647 37
2020 1593 469 1125 1
2021 1544 695 926 77
2022 1636 1000 662 26
2023 1523 1000 554 31
2024 1120 698 459 37

Bunschoten
Image

Enkhuizen
Image

IJsstelstein
Image

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    bugHigh priority issue for (blocking) problems

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions