There are multiple data quality issues:
- handwritten digits are not inside of the fields
- people make annotations such as "per hour"
- different pages used, i.e. different colours of writing and background
- corrections in a different colour
- scan quality is not optimal
VLMs/Multimodal Language Models or similar models might work, but
- we cannot reliably evaluate performance as this years wage input fields do not have digit seperated digit input fields
- to improve performance, fine-tuning/training would make sense, but result might only be valid for/be sensitive to the seperated digit input fields
The problem of the data differences could be mitigated by segmenting the images and extracting only the handwritten text, however this is where the data quality issues complicates things again and makes the problem not trivial to solve.
Given a folder of pdfs (in a S3) bucket:
- Moves all images to a seperate bucket converted to an image and resized
- Giving the converted images an id, extracts segments corresponding to form fields as arrays and saves them in a database, that also contains url of converted image
- Insufficient training/finetuning data :S
-
Images should be positioned the exact same way for manual segmentation
-
Use the original writing not the carbon copy ("Durchschreibsatz") -> handwriting too thin
-
Improve scan quality (remove smudges/unsharp areas) and avoid compression artifacts (image has high resolution but bad quality)
-
Specify date format for abweichendes Datum
-
Fill out more fields with printed text advance: e.g. abweichendes Datum, Schule and Teilnehmer:innen
-
Ask personell to hightlight that text should be written in (legible) upper case and inside of the form fields -> also ask schools to verify this after collecting forms
-
make sure that carbon copies are positioned well enough so that writing appears in the correct places
-
Adresskonsolidierung
- Why does 560281 contain the logo? Are some companies using stamps? -> Pls do not
- What if a form is invalid (e.g. some mandatory fields are not filled) -> when is a form invalid? (e.g payer equals organization checkbox checked but there is payer data)
- Object detection