GitHub - chan27-2/hack_your

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.idea		.idea
.vscode		.vscode
__pycache__		__pycache__
bits		bits
a.py		a.py
app.py		app.py
ben.traineddata		ben.traineddata
bengali1.pdf		bengali1.pdf
bengali[12-21].pdf		bengali[12-21].pdf
c1.pdf		c1.pdf
c1_removed.pdf		c1_removed.pdf
d		d
hin.traineddata		hin.traineddata
input.txt		input.txt
metadata.pdf		metadata.pdf
out.txt		out.txt
out_text.txt		out_text.txt
parse		parse
parsing.cpp		parsing.cpp
readme.txt		readme.txt
relations.txt		relations.txt
script.sh		script.sh
test.cpp		test.cpp
test.py		test.py

Repository files navigation

Our solution consists of 3 parts: Firstly, extract pdf data (in the form of scanned images) of regional language to text file of regional language parsable english text, secondly, translating regional language text to English language text and thirdly, parsing the extracted data to get relations. For the first part, we have used tesseract ocr to scan a pdf and convert pdf pages to text, and for the second part, tesseract supports multi language extraction like Bengali, Hindi, Telugu, Tamil etc… Inorder to achieve this, we added .traineddata file for the regional language which we need to extract. We have added trained data for Hindi and Bengali as English is default.  Once we get Bengali/Hindi text from a pdf file, we translate it to English using googletrans. Then the English text is readable but is in random order, so we have written a parsing algorithm which can extract correct relations from that random ordered text file. For extracting all the files of Chandigarh state, we created a python script to extract all the pdf files and ran our above code to get all the relations in the state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

chan27-2/hack_your_way

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages