Perform OCR and extract Word documents to avoid losing about 4.5% of documents

We're currently disregarding scanned documents which don't contain the text in digital form. The latest counts from the preprocessing pipeline are:

```
Missing content for 2366 documents (4.5%):
document_source
fedlex          1016
openparldata    1350
```

Instead of dropping them, we should attempt to extract the text using OCR.