read-presonal-data

Experimental application that can read PDF and JPG/PNG files and spot personal information or personal photo.

PDF files are extracted with pymupdf library. Each page of pdf is processed sequentially. Named Entity Recognition NER using spacy and presidio is applied on the text of the whole page. Only english is supported. Supported entities are described here. Detected entities and personal information are then visualized in streamlit app.

Image files and any images extracted from PDF file are scanned for faces using opencv and its haarcascade_frontalface detector. Using pytesseract and presidio OCR with entity recognition is applied on text in image. Found faces and personal text data are highlighted with bounding boxes in the streamlit app.

Goal

Extraction of Text and Identification of face in documents. Given a document of type PDF, PNG or JPG, the program should

Extract all the text present in the document
Classify if the document is Personal or Non-Personal
Identify if the document contains any face in it

Setup

Install conda environment from env.yml

conda env create -f env.yml
python -m spacy download en_core_web_lg
mkdir require
curl -o require/haarcascade_frontalface_default.xml https://raw.githubusercontent.com/opencv/opencv/master/data/haarcascades/haarcascade_frontalface_default.xml
streamlit run streamlit_app.py

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
require		require
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml
streamlit_app.py		streamlit_app.py
style.css		style.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

read-presonal-data

Goal

Setup

References

About

Uh oh!

Releases

Packages

Languages

License

konvica/read-presonal-data

Folders and files

Latest commit

History

Repository files navigation

read-presonal-data

Goal

Setup

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages