A comprehensive platform for end-to-end evalution of PII detection and anonymizing in unstrcutured text documents using various SOTA models.
| Home Page | Results View |
|---|---|
![]() |
![]() |
This application provides an interactive interface for detecting PII entities in text with features including:
- Support for multiple NER engines (SpaCy, HuggingFace Transformers, GLiNER)
- Multiple anonymization methods (redaction, replacement, masking, encryption)
- Synthetic data generation with OpenAI
- Model evaluation capabilities
- Entity-type performance metrics
- Customizable detection settings
- Clone this repository
- Create a virtual environment
- Install dependencies with Poetry:
# Install Poetry if needed
pip install poetry
# Install dependencies
poetry install
# Run the application
poetry run streamlit run app.py- Clone this repository
- Install dependencies:
pip install -r requirements.txt
# Run the application
streamlit run app.py- Enter or upload text in the input panel
- Configure model and detection settings in the sidebar
- Click "Analyze Text" to detect PII entities
- View detection results and anonymized output
- Navigate to the "Model Evaluation" tab
- Upload a CSV file with labeled data
- Configure evaluation settings
- Click "Evaluate Model" to assess model performance
- View evaluation metrics and entity-type performance
- SpaCy: Fast and efficient NER models for production use
- HuggingFace Transformers: Deep learning models with higher accuracy
- GLiNER: Generalist model for zero-shot named entity recognition
- Redact: Remove PII entities completely
- Replace: Replace with entity type (e.g.,
<PERSON>) - Mask: Replace characters with a mask character (e.g.,
*******) - Hash: Replace with a hash of the text
- Encrypt: Encrypt the text (reversible)
- Highlight: View original text with highlighted entities
- Synthesize: Replace with realistic fake values using OpenAI
- Allow/Deny Lists: Customize detection with word lists
- Custom Regex Patterns: Define your own entity patterns
- Overlap Handling: Configurable handling of overlapping entities
- Entity Selection: Choose which entity types to detect
- Decision Process: View reasoning behind detection decisions
The evaluation feature expects a CSV file with:
- A
textcolumn containing the text to evaluate - A
labelcolumn with JSON annotations in this format:
[{"start": 58, "end": 89, "text": "John Smith", "labels": ["name"]}]pii-detection/
├── app.py # Main application entry point
├── config/ # Configuration settings
├── models/ # Model implementations
├── core/ # Core detection and anonymization logic
├── ui/ # UI components
├── utils/ # Utility functions
├── data/ # Sample data and resources
└── README.md # Project documentation
- presidio-analyzer: Microsoft's PII detection library
- presidio-anonymizer: Microsoft's PII anonymization library
- streamlit: Web application framework
- spacy: NLP toolkit
- transformers: Hugging Face's transformer models
- pandas: Data manipulation
- openai: OpenAI API for synthetic data
- This application uses Microsoft Presidio for PII detection and anonymization
- UI built with Streamlit

