Skip to content

chan27-2/hack_your_way

Repository files navigation

Our solution consists of 3 parts: Firstly, extract pdf data (in the form of scanned images) of regional language to text file of regional language parsable english text, secondly, translating regional language text to English language text and thirdly, parsing the extracted data to get relations. For the first part, we have used tesseract ocr to scan a pdf and convert pdf pages to text, and for the second part, tesseract supports multi language extraction like Bengali, Hindi, Telugu, Tamil etc… Inorder to achieve this, we added .traineddata file for the regional language which we need to extract. We have added trained data for Hindi and Bengali as English is default.  Once we get Bengali/Hindi text from a pdf file, we translate it to English using googletrans. Then the English text is readable but is in random order, so we have written a parsing algorithm which can extract correct relations from that random ordered text file. For extracting all the files of Chandigarh state, we created a python script to extract all the pdf files and ran our above code to get all the relations in the state.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •