Skip to content

321 vat number median height #375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

Tr4in
Copy link
Contributor

@Tr4in Tr4in commented Sep 1, 2017

Closes #321

Works fine 👍

@ghost ghost assigned Tr4in Sep 1, 2017
@ghost ghost added the needs review label Sep 1, 2017
@Tr4in Tr4in requested a review from tamacodechi September 1, 2017 16:12
@clemenshelm
Copy link
Owner

Why did you remove the old spec?

@Tr4in
Copy link
Contributor Author

Tr4in commented Sep 7, 2017

The old spec depends on median height which we don't need anymore due to the new Tesseract version. Now tesseract recognizes IE6388047V as a whole word on a google bill.

@Tr4in
Copy link
Contributor Author

Tr4in commented Sep 7, 2017

The old tesseract recognized IE then Rechnung and then 6388047V which isn't valid! Therefore we used the median height to remove the Rechnung.

But now it recognizes IE6388047V as a whole word!

@tamacodechi
Copy link
Contributor

If I understand correctly then, you didn't really remove the spec but update it with data from the newest version of tesseract?

@Tr4in
Copy link
Contributor Author

Tr4in commented Sep 11, 2017

right

Copy link
Contributor

@tamacodechi tamacodechi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very promising! 👍
If I understand correctly, we do, in fact, not need the median height hack since the new version of tesseract picks up the vat ID as a single word. Why did you test it with a different file though? Can you just rewrite the data with the old file? this way we can know for sure this works for all files that required this weird quirky hack.

@Tr4in
Copy link
Contributor Author

Tr4in commented Sep 14, 2017

On the old bill wrj8fiNZQYjymoocT.pdf the number after IE does not get recognized:
selection_014

I can't update the old test because this number is missing(didn't get recognized)

@tamacodechi
Copy link
Contributor

Oh wow, that's so annoying! Have you checked with other google bills to see if this is a common occurrence? Let me know if you need more bill IDs!

@Tr4in
Copy link
Contributor Author

Tr4in commented Sep 20, 2017

If you take a look at the list you will see that on the checked ones it gets recognized(I've tested every bill id)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants