Open
Description
Problem
I've found that in the notubiz API there are several 'hidden' agenda items and documents which are currently not being scraped resulting in a large difference between documents actually available on the municipalities sites and on ORI.
There are two main issues I have found. Both of which are related to the agenda items properties. Currently an agenda item from a meeting (.agenda_items[]
) is parsed only on the .documents[]
. However there are two more properties which are interesting:
.module_items[]
: A module item in itself does not look interesting (see below). But once fetching this item using the.self
-property we find that a module item can have several documents containing it.
(see point 6 on municipality website)
(see 7th agenda item:https://api.notubiz.nl/events/meetings/1152031?format=json&version=1.17.0
)
data:image/s3,"s3://crabby-images/c496f/c496f15763b99c1d4e15f6f8fa1fabbaab9e26e5" alt="Image"
.agenda_items[]
: An agenda item itself can contain several more agenda items which (again) do not look interesting at first (see below), but when fetching them outright they can ofcourse contain documents again (and even more agenda items or module items..)
(they have a special suffix on the municipality website)
(see 14th agenda item:https://api.notubiz.nl/events/meetings/1161553?format=json&version=1.17.0
)
data:image/s3,"s3://crabby-images/52dd5/52dd50123e2d7ecc6d48667f326ec72ff1fe6e4c" alt="Image"
Some examples of missing documents
(note that my simple scraper is also missing some documents atm, but has better coverage for the notubiz api)
Breda
year | scraped_from_notubiz | scraped_from_ori | in_notubiz_not_in_ori | in_ori_not_in_notubiz |
---|---|---|---|---|
2014 | 0 | 126 | 0 | 126 |
2015 | 0 | 578 | 0 | 578 |
2016 | 3740 | 1000 | 2856 | 116 |
2017 | 1745 | 1000 | 933 | 188 |
2018 | 431 | 243 | 198 | 10 |
2019 | 1915 | 218 | 1698 | 1 |
2020 | 227 | 157 | 72 | 2 |
2021 | 193 | 184 | 10 | 1 |
2022 | 220 | 186 | 37 | 3 |
2023 | 240 | 221 | 22 | 3 |
2024 | 206 | 155 | 56 | 5 |
Waddinxveen
year | scraped_from_notubiz | scraped_from_ori | in_notubiz_not_in_ori | in_ori_not_in_notubiz |
---|---|---|---|---|
2014 | 0 | 964 | 0 | 964 |
2015 | 0 | 1000 | 0 | 1000 |
2016 | 5166 | 1000 | 4391 | 225 |
2017 | 2362 | 0 | 2362 | 0 |
2018 | 1514 | 988 | 601 | 75 |
2019 | 1603 | 993 | 647 | 37 |
2020 | 1593 | 469 | 1125 | 1 |
2021 | 1544 | 695 | 926 | 77 |
2022 | 1636 | 1000 | 662 | 26 |
2023 | 1523 | 1000 | 554 | 31 |
2024 | 1120 | 698 | 459 | 37 |
Activity