Conversation
|
Needs more testing, converting to draft for now. |
At the time of generation of the section, the {gt,ocr}_words generators
were drained. Fix by using a list.
Fixes gh-124.
9405df7 to
a70260c
Compare
|
I've added a check list above to go through the various CLIs and test them. Because this also adds support to specify a plain text encoding. I've also added this to the check list. |
|
|
|
Ha, |
Fixed in 14a4bc5. |
|
|
|
|
|
Manual test of ocrd-dinglehopper also correctly warns about autodetecting the plain text encoding + has the option to give an explicit encoding. Don't see how to stick the information about the plain text encoding into the METS file - that could be an improvement over this. Maybe @bertsky has an idea? (I see comparing to txt GT as useful in some cases, e.g. when working with corpora where only the text is available but no PAGE/ALTO.) |
|
The help text of |
|
|
|
I've added a test for plain text files with BOM. |
This adds more flexibility w.r.t. evaluating directories of line texts.
Test dinglehopper
Test dinglehopper-line-dirs
Test dinglehopper-extract
Test
dinglehopper-summarizeTest ocrd-dinglehopper
Update docs w.r.t this feature
dinglehopper-line-dirs --helpREADME.mdReview Unexpected UTF-8 problems #123