-
Notifications
You must be signed in to change notification settings - Fork 0
chan27-2/hack_your_way
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Our solution consists of 3 parts: Firstly, extract pdf data (in the form of scanned images) of regional language to text file of regional language parsable english text, secondly, translating regional language text to English language text and thirdly, parsing the extracted data to get relations. For the first part, we have used tesseract ocr to scan a pdf and convert pdf pages to text, and for the second part, tesseract supports multi language extraction like Bengali, Hindi, Telugu, Tamil etc… Inorder to achieve this, we added .traineddata file for the regional language which we need to extract. We have added trained data for Hindi and Bengali as English is default. Once we get Bengali/Hindi text from a pdf file, we translate it to English using googletrans. Then the English text is readable but is in random order, so we have written a parsing algorithm which can extract correct relations from that random ordered text file. For extracting all the files of Chandigarh state, we created a python script to extract all the pdf files and ran our above code to get all the relations in the state.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published