|Zero-Trust PII Scrubbing + Temporal Medical Record Compilation
Sanitize medical documents locally in your browser. Generate LLM-optimized timelines with content-based deduplication, structured lab extraction, and chronological organization.
π Try it Live β’ Features β’ Quick Start β’ Documentation
Scrubah.PII transforms messy medical records into clean, LLM-ready datasets using a triple-pipeline architecture:
- Regex Detection: Structural PII patterns (email, phone, SSN, MRN, dates)
- ML Detection: Named entities (names, locations, organizations) via BERT NER
- Placeholder Generation: Consistent redaction across all documents
- Structured Extraction: Lab values, imaging findings, pathology results
- Safe-by-Design: Only extracts validated medical data, PII excluded by design
- Timeline Format: Clean markdown tables optimized for LLM consumption
Intelligent document compression for LLM context optimization:
- OCR Quality Gate: Filter low-quality scans (configurable threshold)
- Template Detection: Strip boilerplate headers/footers (81% compression on repetitive content)
- Semantic Deduplication: Remove similar documents using embedding similarity
- Structured Extraction: Extract labs, meds, diagnoses, vitals, imaging findings
- Narrative Generation: Generate concise clinical summaries (62% compression)
- Document Parsing: PDFs (digital + OCR), DOCX, images, text files
- Smart Deduplication: SHA-256 exact + semantic similarity for fuzzy matching
- 100% Local: All processing in-browser using WebAssembly ML models
Perfect for: Healthcare researchers, clinical data analysts, AI medical applications, HIPAA-compliant workflows
- No server uploads - Everything runs locally via WASM
- No API calls - NER model runs in-browser
- IndexedDB storage - Data never leaves your machine
- Open source - Audit the code yourself
Blacklist Approach (PII Scrubbing):
- Regex patterns: Email, phone, SSN, credit cards, MRN (with context awareness)
- ML entity recognition: Names (PER), locations (LOC), organizations (ORG) via BERT NER
- Confidence scoring: 85%+ threshold to reduce false positives
- Placeholder consistency: Same entity β same placeholder across documents
Whitelist Approach (Clinical Extraction) - New in v2.0:
- Structured data only: Lab values, diagnoses, imaging findings, medications
- PII-free by design: Names/identifiers never enter extraction pipeline
- Safe clinical output: Only validated medical terminology in results
- Why safer: Concatenated PII (e.g., "SMITH,JOHN01/15/1980") bypasses regex but won't appear in extracted lab values
- Smart deduplication: SHA-256 exact + semantic embedding similarity
- Date extraction: From filenames and document content (date-fns)
- Document classification: Labs, imaging, progress notes, pathology, etc.
- Structured lab data: 30+ common tests extracted into tables
- Trend analysis: Automatic comparison of sequential lab values
- Cross-referencing: Links between related documents
- OCR Quality Gate: Character pattern analysis, configurable thresholds
- Template Detection: N-gram fingerprinting with FNV-1a hashing
- Semantic Dedup: Cosine similarity on word embeddings, Union-Find clustering
- Structured Extraction: Regex-based clinical data extraction with confidence scoring
- Narrative Generation: Template-based summarization, configurable verbosity
- Chunked processing: 2000-char chunks for optimal ML inference
- Progress logging: Real-time console feedback
- Background processing: Non-blocking UI updates
- Efficient tokenization: 40% token reduction via table formatting
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Browser (Client) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β React UI β File Upload β Triple Processing Pipeline β
β β β β
β Parser β βββββββββββββββββββββββββββββββββββββββ β
β (PDF.js) β PIPELINE 1: Blacklist (Scrubbing) β β
β β β Regex + BERT NER β Placeholders β β
β βββββββββββββββββββββββββββββββββββββββ β
β β β β
β βββββββββββββββββββββββββββββββββββββββ β
β β PIPELINE 2: Whitelist (Extraction) β β
β β Structured Medical Data Only β β
β βββββββββββββββββββββββββββββββββββββββ β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PIPELINE 3: Compression (77% reduction) β β
β β OCR Gate β Templates β Dedup β Extract β Narrate β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β β
β Dexie β IndexedDB β Timeline Generator β
β (Content Hasher + Markdown) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββStack:
- Frontend: React 18 + TypeScript 5.9 + Vite 7.2
- Parsing: PDF.js (digital + OCR), Mammoth (DOCX), Tesseract.js (images)
- ML: Hugging Face Transformers.js (Xenova/bert-base-NER, quantized)
- Storage: Dexie (IndexedDB wrapper)
- Utilities: date-fns, clsx, tailwind-merge, JSZip
- Testing: Vitest + React Testing Library
- Node.js 18+ (for dev server)
- Modern browser with WASM support (Chrome 91+, Firefox 89+, Safari 15+)
# Clone the repository
git clone https://github.com/Heyoub/scrubah-pii.git
cd scrubah-pii
# Install dependencies
pnpm install
# Start development server
pnpm startOpen http://localhost:3500/ (or check console for port)
Development (Local Bundling)
pnpm start- Uses local bundled assets via Vite
- All processing runs entirely in your browser
- Zero external API calls
- Complete privacy guarantee
Production (Pre-built)
- Pre-built deployments may use ESM importmap with CDN for module delivery
- All data processing still runs 100% locally in your browser
- CDN only delivers static JavaScript modules, not data
- Verify deployment source before using with sensitive documents
- Upload Documents: Drag & drop PDFs, DOCX, or images
- Wait for Processing: PII detection runs automatically
- Download Options:
- Individual Files: Click download icon per file
- Zip Bundle: Download all processed files
- Master Timeline: Generate chronological medical record
graph TD
A[Upload medical PDFs] --> B[Wait for green checkmarks]
B --> C[Click Generate Timeline button]
C --> D[Download Medical_Timeline_YYYY-MM-DD.md]
D --> E[Feed to your AI of choice for analysis]
Example Timeline Output:
# π₯ Medical Record Timeline
## π Summary
- Date Range: 2018-07-19 β 2025-11-20
- Total: 142 files (89 unique, 53 duplicates)
- Labs: 45 | Imaging: 18 | Progress Notes: 26
---
### π§ͺ 2025-10-22 | Lab Results
**Document #87** | Hash: `a3f9c2d1`
| Test | Value | Reference | Status |
|------|-------|-----------|--------|
| WBC | 8.5 | 4.0-11.0 | β
Normal |
| HGB | 13.2 | 13.5-17.5 | β¬οΈ Low |
#### Trends vs Previous
- HGB: 14.1 β 13.2 (β -6.4%)
---
### [DUPLICATE] 2025-10-22 | Lab Results (1).pdf
β οΈ Exact duplicate of document #87. Content omitted.- Timeline Usage Guide - How to use the timeline feature
- Timeline Implementation Guide - Technical deep dive
- API Documentation - Service interfaces (below)
# Run all tests
pnpm test
# Run with UI
pnpm run test:ui
# Run with coverage
pnpm run test:coverage
# Type checking
pnpm run build # Runs tsc + vite buildTest Coverage:
- File Parser: PDF (digital + OCR), DOCX, images
- PII Scrubber: Regex patterns, ML inference, placeholder consistency
- Markdown Formatter: YAML frontmatter, artifact removal
- Compression Pipeline (229 tests):
- Template Detection: 49 tests (N-gram fingerprinting, 81% compression)
- Semantic Dedup: 64 tests (cosine similarity, Union-Find clustering)
- Structured Extraction: 51 tests (labs, meds, diagnoses, vitals)
- Narrative Generation: 38 tests (template-based summaries, 62% compression)
- Unified Pipeline: 27 tests (end-to-end orchestration, 77% compression)
No environment variables required! Everything runs locally.
Add Custom Lab Tests (services/labExtractor.ts):
const LAB_TEST_PATTERNS = {
CUSTOM_TEST: /(?:Test Name).*?(\d+\.?\d*)\s*(?:unit)/i,
// Add your patterns here
};Adjust Duplicate Threshold (services/contentHasher.ts):
if (similarity >= 0.95) { // Change threshold here
return { isDuplicate: true, ... };
}Modify ML Confidence (services/piiScrubber.ts):
const entities = output.filter(e => e.score > 0.85); // Adjust hereParses various file formats into plain text.
Supported Formats: PDF (digital + OCR), DOCX, Images, Text files
Removes PII using hybrid regex + ML approach (Effect-TS).
Returns:
interface ScrubResult {
text: ScrubbedText; // Branded type (type-safe)
replacements: PIIMap; // Original β Placeholder mapping
count: number; // Total entities replaced
}Extracts only structured clinical data (PII-free by design).
Returns:
interface MedicalData {
documentType: "lab_report" | "imaging" | "pathology" | "clinical_note";
labPanels: LabPanel[]; // Structured lab results
imagingFindings: ImagingResult[]; // Radiology findings
pathology: PathologyResult[]; // Pathology diagnoses
diagnoses: Diagnosis[]; // Clinical diagnoses
medications: Medication[]; // Medication lists
// PII never enters this structure
}Generates PII-free medical timeline from structured extractions.
Returns:
interface Timeline {
markdown: string; // Clean clinical timeline
extraction: ExtractionStats; // Success/failure counts
}Runs documents through the full compression pipeline.
Stages:
- OCR Quality Gate - Filter low-quality scans
- Template Detection - Strip boilerplate (81% compression)
- Semantic Dedup - Remove similar documents
- Structured Extraction - Extract clinical data
- Narrative Generation - Generate summaries (62% compression)
Returns:
interface PipelineResult {
documents: DocumentResult[];
compressionRatio: number; // 0-1, typically 0.77 (77%)
totalInputChars: number;
totalOutputChars: number;
ocrFilteredCount: number;
duplicatesRemoved: number;
stages: StageResult[];
}// Template Detection - N-gram fingerprinting
TemplateDetectionService.buildCorpus(docs): Effect<TemplateCorpus>
TemplateDetectionService.stripTemplates(doc, corpus): Effect<StrippedDocument>
// Semantic Deduplication - Embedding similarity
SemanticDedupService.findDuplicates(docs, config): Effect<DeduplicationResult>
// Structured Extraction - Clinical data
StructuredExtractionService.extractAll(doc): Effect<ExtractionResult>
StructuredExtractionService.extractLabs(text): Effect<LabPanel[]>
StructuredExtractionService.extractMedications(text): Effect<Medication[]>
// Narrative Generation - Summaries
NarrativeGenerationService.generate(input, config): Effect<NarrativeResult>Legacy blacklist-only timeline generator. Use whitelist pipeline for safer output.
React + TypeScript: Type-safe UI development with excellent developer experience
Vite: Lightning-fast HMR, optimized production builds, native ESM support
Transformers.js: Run Hugging Face models in-browser via WASM (no server needed)
PDF.js: Mozilla's battle-tested PDF renderer, handles both digital and scanned PDFs
Dexie: Best-in-class IndexedDB wrapper with TypeScript support
date-fns: Lightweight (13KB gzipped), tree-shakeable, comprehensive date utilities
Contributions welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow existing code style (TypeScript strict mode)
- Add tests for new features
- Update documentation
- Run
pnpm run buildbefore committing (type checks)
Timeline Generation (tested on i7 + 3GB VRAM):
- 10 documents: ~100-200ms
- 50 documents: ~300-500ms
- 100 documents: ~500-800ms
- 200+ documents: ~1-2s
PII Scrubbing (per document):
- Small (< 5 pages): ~2-5s
- Medium (5-20 pages): ~5-15s
- Large (20+ pages): ~15-30s
Token Efficiency:
- Individual files: ~213,000 tokens (142 files)
- Master timeline: ~130,000 tokens (40% reduction!)
- No server uploads: All processing happens in-browser
- No external APIs: ML models run via WASM
- No telemetry: Zero tracking or analytics
- Open source: Fully auditable code
While Scrubah.PII runs locally and maintains privacy, it is provided as-is without warranty. Healthcare organizations must:
- Conduct their own security audit
- Implement appropriate safeguards per HIPAA requirements
- Test thoroughly before production use
- Consult legal counsel for compliance
MIT License - see LICENSE file for details.
Built by @Heyoub for @forgestack
Libraries:
- Transformers.js - Hugging Face models in browser
- PDF.js - Mozilla PDF renderer
- Tesseract.js - OCR engine
- date-fns - Modern date utilities
- Dexie - IndexedDB wrapper
- Author: @Heyoub
- Email: hello@forgestack.app
- Issues: GitHub Issues
Β© 2025 Forgestack.app