Skip to content

heyoub/scrubah.pii

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

142 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Scrubah.PII - Forensic Medical Data Sanitizer

License: MIT TypeScript React Vite Live Demo

|Zero-Trust PII Scrubbing + Temporal Medical Record Compilation

Sanitize medical documents locally in your browser. Generate LLM-optimized timelines with content-based deduplication, structured lab extraction, and chronological organization.

πŸš€ Try it Live β€’ Features β€’ Quick Start β€’ Documentation


🎯 What It Does

Scrubah.PII transforms messy medical records into clean, LLM-ready datasets using a triple-pipeline architecture:

Pipeline 1: Blacklist (PII Scrubbing)

  1. Regex Detection: Structural PII patterns (email, phone, SSN, MRN, dates)
  2. ML Detection: Named entities (names, locations, organizations) via BERT NER
  3. Placeholder Generation: Consistent redaction across all documents

Pipeline 2: Whitelist (Clinical Extraction)

  1. Structured Extraction: Lab values, imaging findings, pathology results
  2. Safe-by-Design: Only extracts validated medical data, PII excluded by design
  3. Timeline Format: Clean markdown tables optimized for LLM consumption

Pipeline 3: Compression (77% Reduction) - New!

Intelligent document compression for LLM context optimization:

  1. OCR Quality Gate: Filter low-quality scans (configurable threshold)
  2. Template Detection: Strip boilerplate headers/footers (81% compression on repetitive content)
  3. Semantic Deduplication: Remove similar documents using embedding similarity
  4. Structured Extraction: Extract labs, meds, diagnoses, vitals, imaging findings
  5. Narrative Generation: Generate concise clinical summaries (62% compression)

Additional Features

  • Document Parsing: PDFs (digital + OCR), DOCX, images, text files
  • Smart Deduplication: SHA-256 exact + semantic similarity for fuzzy matching
  • 100% Local: All processing in-browser using WebAssembly ML models

Perfect for: Healthcare researchers, clinical data analysts, AI medical applications, HIPAA-compliant workflows


✨ Features

πŸ”’ Privacy-First Architecture

  • No server uploads - Everything runs locally via WASM
  • No API calls - NER model runs in-browser
  • IndexedDB storage - Data never leaves your machine
  • Open source - Audit the code yourself

🧠 Dual-Pipeline PII Safety

Blacklist Approach (PII Scrubbing):

  • Regex patterns: Email, phone, SSN, credit cards, MRN (with context awareness)
  • ML entity recognition: Names (PER), locations (LOC), organizations (ORG) via BERT NER
  • Confidence scoring: 85%+ threshold to reduce false positives
  • Placeholder consistency: Same entity β†’ same placeholder across documents

Whitelist Approach (Clinical Extraction) - New in v2.0:

  • Structured data only: Lab values, diagnoses, imaging findings, medications
  • PII-free by design: Names/identifiers never enter extraction pipeline
  • Safe clinical output: Only validated medical terminology in results
  • Why safer: Concatenated PII (e.g., "SMITH,JOHN01/15/1980") bypasses regex but won't appear in extracted lab values

πŸ“Š Intelligent Timeline Compilation

  • Smart deduplication: SHA-256 exact + semantic embedding similarity
  • Date extraction: From filenames and document content (date-fns)
  • Document classification: Labs, imaging, progress notes, pathology, etc.
  • Structured lab data: 30+ common tests extracted into tables
  • Trend analysis: Automatic comparison of sequential lab values
  • Cross-referencing: Links between related documents

πŸ—œοΈ Compression Pipeline (229 tests)

  • OCR Quality Gate: Character pattern analysis, configurable thresholds
  • Template Detection: N-gram fingerprinting with FNV-1a hashing
  • Semantic Dedup: Cosine similarity on word embeddings, Union-Find clustering
  • Structured Extraction: Regex-based clinical data extraction with confidence scoring
  • Narrative Generation: Template-based summarization, configurable verbosity

πŸš€ Performance Optimized

  • Chunked processing: 2000-char chunks for optimal ML inference
  • Progress logging: Real-time console feedback
  • Background processing: Non-blocking UI updates
  • Efficient tokenization: 40% token reduction via table formatting

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Browser (Client)                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  React UI  β†’  File Upload  β†’  Triple Processing Pipeline                β”‚
β”‚     ↓              ↓                                                     β”‚
β”‚  Parser    β†’  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”‚
β”‚  (PDF.js)     β”‚  PIPELINE 1: Blacklist (Scrubbing)  β”‚                  β”‚
β”‚     ↓         β”‚  Regex + BERT NER β†’ Placeholders    β”‚                  β”‚
β”‚               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β”‚     ↓                        ↓                                          β”‚
β”‚               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”‚
β”‚               β”‚  PIPELINE 2: Whitelist (Extraction) β”‚                  β”‚
β”‚               β”‚  Structured Medical Data Only       β”‚                  β”‚
β”‚               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β”‚     ↓                        ↓                                          β”‚
β”‚               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚               β”‚  PIPELINE 3: Compression (77% reduction)            β”‚  β”‚
β”‚               β”‚  OCR Gate β†’ Templates β†’ Dedup β†’ Extract β†’ Narrate   β”‚  β”‚
β”‚               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚     ↓              ↓                    ↓                               β”‚
β”‚  Dexie     β†’   IndexedDB    β†’    Timeline Generator                    β”‚
β”‚                                  (Content Hasher + Markdown)            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Stack:

  • Frontend: React 18 + TypeScript 5.9 + Vite 7.2
  • Parsing: PDF.js (digital + OCR), Mammoth (DOCX), Tesseract.js (images)
  • ML: Hugging Face Transformers.js (Xenova/bert-base-NER, quantized)
  • Storage: Dexie (IndexedDB wrapper)
  • Utilities: date-fns, clsx, tailwind-merge, JSZip
  • Testing: Vitest + React Testing Library

πŸš€ Quick Start

Prerequisites

  • Node.js 18+ (for dev server)
  • Modern browser with WASM support (Chrome 91+, Firefox 89+, Safari 15+)

Installation

# Clone the repository
git clone https://github.com/Heyoub/scrubah-pii.git
cd scrubah-pii

# Install dependencies
pnpm install

# Start development server
pnpm start

Open http://localhost:3500/ (or check console for port)

Deployment Modes

Development (Local Bundling)

pnpm start
  • Uses local bundled assets via Vite
  • All processing runs entirely in your browser
  • Zero external API calls
  • Complete privacy guarantee

Production (Pre-built)

  • Pre-built deployments may use ESM importmap with CDN for module delivery
  • All data processing still runs 100% locally in your browser
  • CDN only delivers static JavaScript modules, not data
  • Verify deployment source before using with sensitive documents

Basic Usage

  1. Upload Documents: Drag & drop PDFs, DOCX, or images
  2. Wait for Processing: PII detection runs automatically
  3. Download Options:
    • Individual Files: Click download icon per file
    • Zip Bundle: Download all processed files
    • Master Timeline: Generate chronological medical record

Timeline Generation

graph TD
    A[Upload medical PDFs] --> B[Wait for green checkmarks]
    B --> C[Click Generate Timeline button]
    C --> D[Download Medical_Timeline_YYYY-MM-DD.md]
    D --> E[Feed to your AI of choice for analysis]
Loading

Example Timeline Output:

# πŸ₯ Medical Record Timeline

## πŸ“Š Summary
- Date Range: 2018-07-19 β†’ 2025-11-20
- Total: 142 files (89 unique, 53 duplicates)
- Labs: 45 | Imaging: 18 | Progress Notes: 26

---

### πŸ§ͺ 2025-10-22 | Lab Results
**Document #87** | Hash: `a3f9c2d1`

| Test | Value | Reference | Status |
|------|-------|-----------|--------|
| WBC  | 8.5   | 4.0-11.0  | βœ… Normal |
| HGB  | 13.2  | 13.5-17.5 | ⬇️ Low |

#### Trends vs Previous
- HGB: 14.1 β†’ 13.2 (↓ -6.4%)

---

### [DUPLICATE] 2025-10-22 | Lab Results (1).pdf
⚠️ Exact duplicate of document #87. Content omitted.

πŸ“š Documentation


πŸ§ͺ Testing

# Run all tests
pnpm test

# Run with UI
pnpm run test:ui

# Run with coverage
pnpm run test:coverage

# Type checking
pnpm run build  # Runs tsc + vite build

Test Coverage:

  • File Parser: PDF (digital + OCR), DOCX, images
  • PII Scrubber: Regex patterns, ML inference, placeholder consistency
  • Markdown Formatter: YAML frontmatter, artifact removal
  • Compression Pipeline (229 tests):
    • Template Detection: 49 tests (N-gram fingerprinting, 81% compression)
    • Semantic Dedup: 64 tests (cosine similarity, Union-Find clustering)
    • Structured Extraction: 51 tests (labs, meds, diagnoses, vitals)
    • Narrative Generation: 38 tests (template-based summaries, 62% compression)
    • Unified Pipeline: 27 tests (end-to-end orchestration, 77% compression)

πŸ”§ Configuration

Environment Variables

No environment variables required! Everything runs locally.

Customization

Add Custom Lab Tests (services/labExtractor.ts):

const LAB_TEST_PATTERNS = {
  CUSTOM_TEST: /(?:Test Name).*?(\d+\.?\d*)\s*(?:unit)/i,
  // Add your patterns here
};

Adjust Duplicate Threshold (services/contentHasher.ts):

if (similarity >= 0.95) {  // Change threshold here
  return { isDuplicate: true, ... };
}

Modify ML Confidence (services/piiScrubber.ts):

const entities = output.filter(e => e.score > 0.85);  // Adjust here

πŸ“– API Documentation

Core Services

Pipeline 1: Blacklist (PII Scrubbing)

parseFile(file: File): Promise<string>

Parses various file formats into plain text.

Supported Formats: PDF (digital + OCR), DOCX, Images, Text files

runScrubPII(text: string): Promise<ScrubResult>

Removes PII using hybrid regex + ML approach (Effect-TS).

Returns:

interface ScrubResult {
  text: ScrubbedText;       // Branded type (type-safe)
  replacements: PIIMap;      // Original β†’ Placeholder mapping
  count: number;             // Total entities replaced
}

Pipeline 2: Whitelist (Clinical Extraction)

extractMedicalData(doc: Document): Effect<MedicalData, ValidationError, never>

Extracts only structured clinical data (PII-free by design).

Returns:

interface MedicalData {
  documentType: "lab_report" | "imaging" | "pathology" | "clinical_note";
  labPanels: LabPanel[];           // Structured lab results
  imagingFindings: ImagingResult[]; // Radiology findings
  pathology: PathologyResult[];     // Pathology diagnoses
  diagnoses: Diagnosis[];           // Clinical diagnoses
  medications: Medication[];        // Medication lists
  // PII never enters this structure
}
runExtractionPipeline(docs: Document[]): Effect<Timeline, ValidationError, never>

Generates PII-free medical timeline from structured extractions.

Returns:

interface Timeline {
  markdown: string;              // Clean clinical timeline
  extraction: ExtractionStats;   // Success/failure counts
}

Pipeline 3: Compression

CompressionPipelineService.process(docs, config?): Effect<PipelineResult>

Runs documents through the full compression pipeline.

Stages:

  1. OCR Quality Gate - Filter low-quality scans
  2. Template Detection - Strip boilerplate (81% compression)
  3. Semantic Dedup - Remove similar documents
  4. Structured Extraction - Extract clinical data
  5. Narrative Generation - Generate summaries (62% compression)

Returns:

interface PipelineResult {
  documents: DocumentResult[];
  compressionRatio: number;        // 0-1, typically 0.77 (77%)
  totalInputChars: number;
  totalOutputChars: number;
  ocrFilteredCount: number;
  duplicatesRemoved: number;
  stages: StageResult[];
}
Individual Stage Services
// Template Detection - N-gram fingerprinting
TemplateDetectionService.buildCorpus(docs): Effect<TemplateCorpus>
TemplateDetectionService.stripTemplates(doc, corpus): Effect<StrippedDocument>

// Semantic Deduplication - Embedding similarity
SemanticDedupService.findDuplicates(docs, config): Effect<DeduplicationResult>

// Structured Extraction - Clinical data
StructuredExtractionService.extractAll(doc): Effect<ExtractionResult>
StructuredExtractionService.extractLabs(text): Effect<LabPanel[]>
StructuredExtractionService.extractMedications(text): Effect<Medication[]>

// Narrative Generation - Summaries
NarrativeGenerationService.generate(input, config): Effect<NarrativeResult>

Legacy Services

buildMasterTimeline(files: ProcessedFile[]): Promise<MasterTimeline>

Legacy blacklist-only timeline generator. Use whitelist pipeline for safer output.


🎨 Tech Stack Details

Why This Stack?

React + TypeScript: Type-safe UI development with excellent developer experience

Vite: Lightning-fast HMR, optimized production builds, native ESM support

Transformers.js: Run Hugging Face models in-browser via WASM (no server needed)

PDF.js: Mozilla's battle-tested PDF renderer, handles both digital and scanned PDFs

Dexie: Best-in-class IndexedDB wrapper with TypeScript support

date-fns: Lightweight (13KB gzipped), tree-shakeable, comprehensive date utilities


🀝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

  • Follow existing code style (TypeScript strict mode)
  • Add tests for new features
  • Update documentation
  • Run pnpm run build before committing (type checks)

πŸ“Š Performance

Timeline Generation (tested on i7 + 3GB VRAM):

  • 10 documents: ~100-200ms
  • 50 documents: ~300-500ms
  • 100 documents: ~500-800ms
  • 200+ documents: ~1-2s

PII Scrubbing (per document):

  • Small (< 5 pages): ~2-5s
  • Medium (5-20 pages): ~5-15s
  • Large (20+ pages): ~15-30s

Token Efficiency:

  • Individual files: ~213,000 tokens (142 files)
  • Master timeline: ~130,000 tokens (40% reduction!)

πŸ›‘οΈ Security & Privacy

Local-First Architecture

  • No server uploads: All processing happens in-browser
  • No external APIs: ML models run via WASM
  • No telemetry: Zero tracking or analytics
  • Open source: Fully auditable code

HIPAA Considerations

While Scrubah.PII runs locally and maintains privacy, it is provided as-is without warranty. Healthcare organizations must:

  • Conduct their own security audit
  • Implement appropriate safeguards per HIPAA requirements
  • Test thoroughly before production use
  • Consult legal counsel for compliance

πŸ“„ License

MIT License - see LICENSE file for details.


πŸ™ Acknowledgments

Built by @Heyoub for @forgestack

Libraries:


πŸ“ž Contact


Β© 2025 Forgestack.app

About

Pii scrubber for medical information optimized for LLM ingestion.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors