We're currently disregarding scanned documents which don't contain the text in digital form. The latest counts from the preprocessing pipeline are:
Missing content for 2366 documents (4.5%):
document_source
fedlex 1016
openparldata 1350
Instead of dropping them, we should attempt to extract the text using OCR.
We're currently disregarding scanned documents which don't contain the text in digital form. The latest counts from the preprocessing pipeline are:
Instead of dropping them, we should attempt to extract the text using OCR.