Scrubah.PII - Forensic Medical Data Sanitizer

|Zero-Trust PII Scrubbing + Temporal Medical Record Compilation

Sanitize medical documents locally in your browser. Generate LLM-optimized timelines with content-based deduplication, structured lab extraction, and chronological organization.

🚀 Try it Live • Features • Quick Start • Documentation

🎯 What It Does

Scrubah.PII transforms messy medical records into clean, LLM-ready datasets using a triple-pipeline architecture:

Pipeline 1: Blacklist (PII Scrubbing)

Regex Detection: Structural PII patterns (email, phone, SSN, MRN, dates)
ML Detection: Named entities (names, locations, organizations) via BERT NER
Placeholder Generation: Consistent redaction across all documents

Pipeline 2: Whitelist (Clinical Extraction)

Structured Extraction: Lab values, imaging findings, pathology results
Safe-by-Design: Only extracts validated medical data, PII excluded by design
Timeline Format: Clean markdown tables optimized for LLM consumption

Pipeline 3: Compression (77% Reduction) - New!

Intelligent document compression for LLM context optimization:

OCR Quality Gate: Filter low-quality scans (configurable threshold)
Template Detection: Strip boilerplate headers/footers (81% compression on repetitive content)
Semantic Deduplication: Remove similar documents using embedding similarity
Structured Extraction: Extract labs, meds, diagnoses, vitals, imaging findings
Narrative Generation: Generate concise clinical summaries (62% compression)

Additional Features

Document Parsing: PDFs (digital + OCR), DOCX, images, text files
Smart Deduplication: SHA-256 exact + semantic similarity for fuzzy matching
100% Local: All processing in-browser using WebAssembly ML models

Perfect for: Healthcare researchers, clinical data analysts, AI medical applications, HIPAA-compliant workflows

✨ Features

🔒 Privacy-First Architecture

No server uploads - Everything runs locally via WASM
No API calls - NER model runs in-browser
IndexedDB storage - Data never leaves your machine
Open source - Audit the code yourself

🧠 Dual-Pipeline PII Safety

Blacklist Approach (PII Scrubbing):

Regex patterns: Email, phone, SSN, credit cards, MRN (with context awareness)
ML entity recognition: Names (PER), locations (LOC), organizations (ORG) via BERT NER
Confidence scoring: 85%+ threshold to reduce false positives
Placeholder consistency: Same entity → same placeholder across documents

Whitelist Approach (Clinical Extraction) - New in v2.0:

Structured data only: Lab values, diagnoses, imaging findings, medications
PII-free by design: Names/identifiers never enter extraction pipeline
Safe clinical output: Only validated medical terminology in results
Why safer: Concatenated PII (e.g., "SMITH,JOHN01/15/1980") bypasses regex but won't appear in extracted lab values

📊 Intelligent Timeline Compilation

Smart deduplication: SHA-256 exact + semantic embedding similarity
Date extraction: From filenames and document content (date-fns)
Document classification: Labs, imaging, progress notes, pathology, etc.
Structured lab data: 30+ common tests extracted into tables
Trend analysis: Automatic comparison of sequential lab values
Cross-referencing: Links between related documents

🗜️ Compression Pipeline (229 tests)

OCR Quality Gate: Character pattern analysis, configurable thresholds
Template Detection: N-gram fingerprinting with FNV-1a hashing
Semantic Dedup: Cosine similarity on word embeddings, Union-Find clustering
Structured Extraction: Regex-based clinical data extraction with confidence scoring
Narrative Generation: Template-based summarization, configurable verbosity

🚀 Performance Optimized

Chunked processing: 2000-char chunks for optimal ML inference
Progress logging: Real-time console feedback
Background processing: Non-blocking UI updates
Efficient tokenization: 40% token reduction via table formatting

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         Browser (Client)                                 │
├─────────────────────────────────────────────────────────────────────────┤
│  React UI  →  File Upload  →  Triple Processing Pipeline                │
│     ↓              ↓                                                     │
│  Parser    →  ┌─────────────────────────────────────┐                  │
│  (PDF.js)     │  PIPELINE 1: Blacklist (Scrubbing)  │                  │
│     ↓         │  Regex + BERT NER → Placeholders    │                  │
│               └─────────────────────────────────────┘                  │
│     ↓                        ↓                                          │
│               ┌─────────────────────────────────────┐                  │
│               │  PIPELINE 2: Whitelist (Extraction) │                  │
│               │  Structured Medical Data Only       │                  │
│               └─────────────────────────────────────┘                  │
│     ↓                        ↓                                          │
│               ┌─────────────────────────────────────────────────────┐  │
│               │  PIPELINE 3: Compression (77% reduction)            │  │
│               │  OCR Gate → Templates → Dedup → Extract → Narrate   │  │
│               └─────────────────────────────────────────────────────┘  │
│     ↓              ↓                    ↓                               │
│  Dexie     →   IndexedDB    →    Timeline Generator                    │
│                                  (Content Hasher + Markdown)            │
└─────────────────────────────────────────────────────────────────────────┘

Stack:

Frontend: React 18 + TypeScript 5.9 + Vite 7.2
Parsing: PDF.js (digital + OCR), Mammoth (DOCX), Tesseract.js (images)
ML: Hugging Face Transformers.js (Xenova/bert-base-NER, quantized)
Storage: Dexie (IndexedDB wrapper)
Utilities: date-fns, clsx, tailwind-merge, JSZip
Testing: Vitest + React Testing Library

🚀 Quick Start

Prerequisites

Node.js 18+ (for dev server)
Modern browser with WASM support (Chrome 91+, Firefox 89+, Safari 15+)

Installation

# Clone the repository
git clone https://github.com/Heyoub/scrubah-pii.git
cd scrubah-pii

# Install dependencies
pnpm install

# Start development server
pnpm start

Open http://localhost:3500/ (or check console for port)

Deployment Modes

Development (Local Bundling)

pnpm start

Uses local bundled assets via Vite
All processing runs entirely in your browser
Zero external API calls
Complete privacy guarantee

Production (Pre-built)

Pre-built deployments may use ESM importmap with CDN for module delivery
All data processing still runs 100% locally in your browser
CDN only delivers static JavaScript modules, not data
Verify deployment source before using with sensitive documents

Basic Usage

Upload Documents: Drag & drop PDFs, DOCX, or images
Wait for Processing: PII detection runs automatically
Download Options:
- Individual Files: Click download icon per file
- Zip Bundle: Download all processed files
- Master Timeline: Generate chronological medical record

Timeline Generation

graph TD
    A[Upload medical PDFs] --> B[Wait for green checkmarks]
    B --> C[Click Generate Timeline button]
    C --> D[Download Medical_Timeline_YYYY-MM-DD.md]
    D --> E[Feed to your AI of choice for analysis]

Example Timeline Output:

# 🏥 Medical Record Timeline

## 📊 Summary
- Date Range: 2018-07-19 → 2025-11-20
- Total: 142 files (89 unique, 53 duplicates)
- Labs: 45 | Imaging: 18 | Progress Notes: 26

---

### 🧪 2025-10-22 | Lab Results
**Document #87** | Hash: `a3f9c2d1`

| Test | Value | Reference | Status |
|------|-------|-----------|--------|
| WBC  | 8.5   | 4.0-11.0  | ✅ Normal |
| HGB  | 13.2  | 13.5-17.5 | ⬇️ Low |

#### Trends vs Previous
- HGB: 14.1 → 13.2 (↓ -6.4%)

---

### [DUPLICATE] 2025-10-22 | Lab Results (1).pdf
⚠️ Exact duplicate of document #87. Content omitted.

📚 Documentation

Timeline Usage Guide - How to use the timeline feature
Timeline Implementation Guide - Technical deep dive
API Documentation - Service interfaces (below)

🧪 Testing

# Run all tests
pnpm test

# Run with UI
pnpm run test:ui

# Run with coverage
pnpm run test:coverage

# Type checking
pnpm run build  # Runs tsc + vite build

Test Coverage:

File Parser: PDF (digital + OCR), DOCX, images
PII Scrubber: Regex patterns, ML inference, placeholder consistency
Markdown Formatter: YAML frontmatter, artifact removal
Compression Pipeline (229 tests):
- Template Detection: 49 tests (N-gram fingerprinting, 81% compression)
- Semantic Dedup: 64 tests (cosine similarity, Union-Find clustering)
- Structured Extraction: 51 tests (labs, meds, diagnoses, vitals)
- Narrative Generation: 38 tests (template-based summaries, 62% compression)
- Unified Pipeline: 27 tests (end-to-end orchestration, 77% compression)

🔧 Configuration

Environment Variables

No environment variables required! Everything runs locally.

Customization

Add Custom Lab Tests (services/labExtractor.ts):

const LAB_TEST_PATTERNS = {
  CUSTOM_TEST: /(?:Test Name).*?(\d+\.?\d*)\s*(?:unit)/i,
  // Add your patterns here
};

Adjust Duplicate Threshold (services/contentHasher.ts):

if (similarity >= 0.95) {  // Change threshold here
  return { isDuplicate: true, ... };
}

Modify ML Confidence (services/piiScrubber.ts):

const entities = output.filter(e => e.score > 0.85);  // Adjust here

📖 API Documentation

Core Services

Pipeline 1: Blacklist (PII Scrubbing)

`parseFile(file: File): Promise<string>`

Parses various file formats into plain text.

Supported Formats: PDF (digital + OCR), DOCX, Images, Text files

`runScrubPII(text: string): Promise<ScrubResult>`

Removes PII using hybrid regex + ML approach (Effect-TS).

Returns:

interface ScrubResult {
  text: ScrubbedText;       // Branded type (type-safe)
  replacements: PIIMap;      // Original → Placeholder mapping
  count: number;             // Total entities replaced
}

Pipeline 2: Whitelist (Clinical Extraction)

`extractMedicalData(doc: Document): Effect<MedicalData, ValidationError, never>`

Extracts only structured clinical data (PII-free by design).

Returns:

interface MedicalData {
  documentType: "lab_report" | "imaging" | "pathology" | "clinical_note";
  labPanels: LabPanel[];           // Structured lab results
  imagingFindings: ImagingResult[]; // Radiology findings
  pathology: PathologyResult[];     // Pathology diagnoses
  diagnoses: Diagnosis[];           // Clinical diagnoses
  medications: Medication[];        // Medication lists
  // PII never enters this structure
}

`runExtractionPipeline(docs: Document[]): Effect<Timeline, ValidationError, never>`

Generates PII-free medical timeline from structured extractions.

Returns:

interface Timeline {
  markdown: string;              // Clean clinical timeline
  extraction: ExtractionStats;   // Success/failure counts
}

Pipeline 3: Compression

`CompressionPipelineService.process(docs, config?): Effect<PipelineResult>`

Runs documents through the full compression pipeline.

Stages:

OCR Quality Gate - Filter low-quality scans
Template Detection - Strip boilerplate (81% compression)
Semantic Dedup - Remove similar documents
Structured Extraction - Extract clinical data
Narrative Generation - Generate summaries (62% compression)

Returns:

interface PipelineResult {
  documents: DocumentResult[];
  compressionRatio: number;        // 0-1, typically 0.77 (77%)
  totalInputChars: number;
  totalOutputChars: number;
  ocrFilteredCount: number;
  duplicatesRemoved: number;
  stages: StageResult[];
}

Individual Stage Services

// Template Detection - N-gram fingerprinting
TemplateDetectionService.buildCorpus(docs): Effect<TemplateCorpus>
TemplateDetectionService.stripTemplates(doc, corpus): Effect<StrippedDocument>

// Semantic Deduplication - Embedding similarity
SemanticDedupService.findDuplicates(docs, config): Effect<DeduplicationResult>

// Structured Extraction - Clinical data
StructuredExtractionService.extractAll(doc): Effect<ExtractionResult>
StructuredExtractionService.extractLabs(text): Effect<LabPanel[]>
StructuredExtractionService.extractMedications(text): Effect<Medication[]>

// Narrative Generation - Summaries
NarrativeGenerationService.generate(input, config): Effect<NarrativeResult>

Legacy Services

`buildMasterTimeline(files: ProcessedFile[]): Promise<MasterTimeline>`

Legacy blacklist-only timeline generator. Use whitelist pipeline for safer output.

🎨 Tech Stack Details

Why This Stack?

React + TypeScript: Type-safe UI development with excellent developer experience

Vite: Lightning-fast HMR, optimized production builds, native ESM support

Transformers.js: Run Hugging Face models in-browser via WASM (no server needed)

PDF.js: Mozilla's battle-tested PDF renderer, handles both digital and scanned PDFs

Dexie: Best-in-class IndexedDB wrapper with TypeScript support

date-fns: Lightweight (13KB gzipped), tree-shakeable, comprehensive date utilities

🤝 Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Follow existing code style (TypeScript strict mode)
Add tests for new features
Update documentation
Run pnpm run build before committing (type checks)

📊 Performance

Timeline Generation (tested on i7 + 3GB VRAM):

10 documents: ~100-200ms
50 documents: ~300-500ms
100 documents: ~500-800ms
200+ documents: ~1-2s

PII Scrubbing (per document):

Small (< 5 pages): ~2-5s
Medium (5-20 pages): ~5-15s
Large (20+ pages): ~15-30s

Token Efficiency:

Individual files: ~213,000 tokens (142 files)
Master timeline: ~130,000 tokens (40% reduction!)

🛡️ Security & Privacy

Local-First Architecture

No server uploads: All processing happens in-browser
No external APIs: ML models run via WASM
No telemetry: Zero tracking or analytics
Open source: Fully auditable code

HIPAA Considerations

While Scrubah.PII runs locally and maintains privacy, it is provided as-is without warranty. Healthcare organizations must:

Conduct their own security audit
Implement appropriate safeguards per HIPAA requirements
Test thoroughly before production use
Consult legal counsel for compliance

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Built by @Heyoub for @forgestack

Libraries:

Transformers.js - Hugging Face models in browser
PDF.js - Mozilla PDF renderer
Tesseract.js - OCR engine
date-fns - Modern date utilities
Dexie - IndexedDB wrapper

📞 Contact

Author: @Heyoub
Email: hello@forgestack.app
Issues: GitHub Issues

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
.claude		.claude
.github/workflows		.github/workflows
components		components
docs		docs
public		public
schemas		schemas
services		services
test		test
.gitignore		.gitignore
.nvmrc		.nvmrc
.pnpmrc		.pnpmrc
ARCHITECTURE.md		ARCHITECTURE.md
App.tsx		App.tsx
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
eslint.config.js		eslint.config.js
index.html		index.html
index.tsx		index.tsx
metadata.json		metadata.json
package.json		package.json
tsconfig.json		tsconfig.json
vite.config.ts		vite.config.ts
vitest.config.ts		vitest.config.ts

Folders and files

Latest commit

History

Repository files navigation

Scrubah.PII - Forensic Medical Data Sanitizer

🎯 What It Does

Pipeline 1: Blacklist (PII Scrubbing)

Pipeline 2: Whitelist (Clinical Extraction)

Pipeline 3: Compression (77% Reduction) - New!

Additional Features

✨ Features

🔒 Privacy-First Architecture

🧠 Dual-Pipeline PII Safety

📊 Intelligent Timeline Compilation

🗜️ Compression Pipeline (229 tests)

🚀 Performance Optimized

🏗️ Architecture

🚀 Quick Start

Prerequisites

Installation

Deployment Modes

Basic Usage

Timeline Generation

📚 Documentation

🧪 Testing

🔧 Configuration

Environment Variables

Customization

📖 API Documentation

Core Services

Pipeline 1: Blacklist (PII Scrubbing)

parseFile(file: File): Promise<string>

runScrubPII(text: string): Promise<ScrubResult>

Pipeline 2: Whitelist (Clinical Extraction)

extractMedicalData(doc: Document): Effect<MedicalData, ValidationError, never>

runExtractionPipeline(docs: Document[]): Effect<Timeline, ValidationError, never>

Pipeline 3: Compression

CompressionPipelineService.process(docs, config?): Effect<PipelineResult>

Individual Stage Services

Legacy Services

buildMasterTimeline(files: ProcessedFile[]): Promise<MasterTimeline>

🎨 Tech Stack Details

Why This Stack?

🤝 Contributing

Development Guidelines

📊 Performance

🛡️ Security & Privacy

Local-First Architecture

HIPAA Considerations

📄 License

🙏 Acknowledgments

📞 Contact

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`parseFile(file: File): Promise<string>`

`runScrubPII(text: string): Promise<ScrubResult>`

`extractMedicalData(doc: Document): Effect<MedicalData, ValidationError, never>`

`runExtractionPipeline(docs: Document[]): Effect<Timeline, ValidationError, never>`

`CompressionPipelineService.process(docs, config?): Effect<PipelineResult>`

`buildMasterTimeline(files: ProcessedFile[]): Promise<MasterTimeline>`

Packages