Skip to content

OCR/Document Understanding Experiments for a CDL Project with Sächsische Jugendstiftung

Notifications You must be signed in to change notification settings

CorrelAid/sj_du

Repository files navigation

Extracting Wages

There are multiple data quality issues:

  • handwritten digits are not inside of the fields
  • people make annotations such as "per hour"
  • different pages used, i.e. different colours of writing and background
  • corrections in a different colour
  • scan quality is not optimal

VLMs/Multimodal Language Models or similar models might work, but

  • we cannot reliably evaluate performance as this years wage input fields do not have digit seperated digit input fields
  • to improve performance, fine-tuning/training would make sense, but result might only be valid for/be sensitive to the seperated digit input fields

The problem of the data differences could be mitigated by segmenting the images and extracting only the handwritten text, however this is where the data quality issues complicates things again and makes the problem not trivial to solve.

Task description

Pipeline description

Given a folder of pdfs (in a S3) bucket:

  1. Moves all images to a seperate bucket converted to an image and resized
  2. Giving the converted images an id, extracts segments corresponding to form fields as arrays and saves them in a database, that also contains url of converted image

Justification for using big multimodal LLMs

  • Insufficient training/finetuning data :S

Ideas to improve accuracy

  • Images should be positioned the exact same way for manual segmentation

  • Use the original writing not the carbon copy ("Durchschreibsatz") -> handwriting too thin

  • Improve scan quality (remove smudges/unsharp areas) and avoid compression artifacts (image has high resolution but bad quality)

  • Specify date format for abweichendes Datum

  • Fill out more fields with printed text advance: e.g. abweichendes Datum, Schule and Teilnehmer:innen

  • Ask personell to hightlight that text should be written in (legible) upper case and inside of the form fields -> also ask schools to verify this after collecting forms

  • make sure that carbon copies are positioned well enough so that writing appears in the correct places

  • Adresskonsolidierung

Open Questions

  • Why does 560281 contain the logo? Are some companies using stamps? -> Pls do not
  • What if a form is invalid (e.g. some mandatory fields are not filled) -> when is a form invalid? (e.g payer equals organization checkbox checked but there is payer data)

TODO:

Wages

  • Object detection

About

OCR/Document Understanding Experiments for a CDL Project with Sächsische Jugendstiftung

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published