Skip to content

Commit 9da8ef8

Browse files
marc-shadeclaude
andcommitted
feat: full HIPAA Safe Harbor coverage (all 18 identifiers), 80+ CUI Registry subcategories, regex bug fixes
Expand enhanced_pii.py to cover all 18 HIPAA Safe Harbor de-identification identifiers per 45 CFR 164.514(b)(2), adding fax numbers, zip codes, vehicle IDs, device IDs (UDI), web URLs, IP addresses, biometric identifiers, full-face photo references, professional licenses, DEA numbers, ages over 89, dates of death, and admission/discharge dates. Fix CAC number and EIN/TIN regex patterns that failed to match natural-language variants. Expand cui_detector.py from ~24 to 80+ CUI categories organized per the NARA CUI Registry, covering critical infrastructure, defense, intelligence, law enforcement, legal, privacy, financial, nuclear, patent, public health, safety, statistical, technology, security, geospatial, and transportation groupings. Add 10 new legacy marking patterns (CEII, UCNI, NNPI, FTI, SBIR, STTR, COMSEC, SAFETY_ACT, CHEM/CFATS, DELIBERATIVE). Fix sanitizer.py hidden XLSX sheet detection to handle both XML attribute orderings. Fix export_control.py foreign person patterns to match plural forms. Update audit_trail.py CEF version to 0.2.1. Add 37 new tests (213 total, all passing). Update README and bump version to 0.2.1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 733dd90 commit 9da8ef8

File tree

9 files changed

+714
-60
lines changed

9 files changed

+714
-60
lines changed

README.md

Lines changed: 35 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,12 @@ Defense-grade document ingestion with CUI detection, ITAR/EAR export control scr
66

77
| Framework | Standard | Coverage |
88
|-----------|----------|----------|
9-
| CUI Program | 32 CFR Part 2002 | CUI marking detection, category validation, marking deficiency checks |
9+
| CUI Program | 32 CFR Part 2002 | CUI marking detection, 80+ category/subcategory validation, marking deficiency checks |
1010
| NIST 800-171 | SP 800-171 Rev 2 | 20+ security controls mapped (3.1.x, 3.3.x, 3.5.x, 3.8.x, 3.13.x) |
1111
| NIST 800-53 | SP 800-53 Rev 5 | AU, SI, AC, MP, SC control families |
1212
| ITAR | 22 CFR 120-130 | USML category detection (I-XXI), technical data screening |
1313
| EAR | 15 CFR 730-774 | ECCN pattern detection, CCL classification |
14-
| HIPAA | 45 CFR 160, 164 | PHI detection (MRN, health plan IDs, patient IDs, dates of service) |
14+
| HIPAA | 45 CFR 164.514(b)(2) | **All 18 Safe Harbor identifiers** -- full de-identification standard coverage |
1515
| Privacy Act | 5 USC 552a | PII detection with regulatory mapping |
1616
| FedRAMP | AU Family | Tamper-evident audit trail with SHA-256 hash chain, CEF export |
1717
| DFARS | 252.204-7012 | CUI protection requirements for defense contractors |
@@ -53,17 +53,45 @@ tests/
5353

5454
### CUI Detection (32 CFR Part 2002)
5555
- Detects CUI markings: `CUI`, `CUI//SP-xxx`, `CUI//REL TO`
56-
- Validates against CUI Registry categories (CTI, PRVCY, INTEL, EXPT, ITAR, etc.)
56+
- Validates against **80+ CUI Registry categories and subcategories** from the NARA CUI Registry, including:
57+
- Critical Infrastructure: CTI, DCRIT, PCII, CEII, SSI
58+
- Defense: ITAR, EXPT, SAMI, UCNI, NNPI, TFNI
59+
- Intelligence: INTEL, FISA, HUMINT, SIGINT, GEOINT, OSINT, MASINT
60+
- Law Enforcement: LES, LESI, GRAND_JURY, INFORMANT, WITNESS, SURVEIL
61+
- Legal: LEGAL, ATTY_WORK, ATTY_CLIENT, DELIBERATIVE
62+
- Privacy: PRVCY, PII, HIPAA, GENE, SORN, EDUCATIONAL, SUBSTANCE
63+
- Financial: TAX, FTI, BANK_SECRECY, PROPIN, PROCUREMENT
64+
- Nuclear: UCNI, NNPI, NNSA, NUCLEAR
65+
- Science/Technology: SBIR, STTR, RESEARCH
66+
- Security: OPSEC, COMSEC, PHYS, INFOSEC, VULN, PENTEST, INCIDENT
67+
- Detects 15+ legacy marking formats (FOUO, SBU, LES, SSI, PCII, CEII, UCNI, NNPI, FTI, COMSEC, etc.)
5768
- Detects classification banners: UNCLASSIFIED, CONFIDENTIAL, SECRET, TOP SECRET, TS//SCI
5869
- Detects dissemination controls: NOFORN, REL TO, ORCON, PROPIN, FISA, IMCON
5970
- Identifies marking deficiencies (missing banners, contradictory markings, legacy FOUO)
6071
- Generates handling recommendations per NIST 800-171
6172
- Risk scoring with NIST control mapping
6273

6374
### Enhanced PII/PHI Detection
64-
- **30+ detection categories** with confidence scoring (high/medium/low)
65-
- Standard PII: email, phone, SSN, credit card, DOB, driver's license, passport, address
66-
- HIPAA PHI: medical record numbers, health plan IDs, patient IDs, dates of service, diagnoses, prescriptions
75+
- **42+ detection categories** with confidence scoring (high/medium/low)
76+
- **Full HIPAA Safe Harbor coverage** -- all 18 identifier categories per 45 CFR 164.514(b)(2):
77+
1. Names (via NER)
78+
2. Geographic subdivisions (zip codes, addresses)
79+
3. Dates (DOB, admission/discharge, death, ages >89)
80+
4. Telephone numbers
81+
5. Fax numbers
82+
6. Email addresses
83+
7. Social Security numbers
84+
8. Medical record numbers
85+
9. Health plan beneficiary numbers (including subscriber/group IDs)
86+
10. Account numbers
87+
11. Certificate/license numbers (driver's license, professional license, DEA)
88+
12. Vehicle identifiers (VIN)
89+
13. Device identifiers (UDI, serial numbers)
90+
14. Web URLs
91+
15. IP addresses
92+
16. Biometric identifiers (fingerprint, voiceprint, retinal scan)
93+
17. Full-face photographs (reference detection)
94+
18. Other unique identifiers
6795
- Defense PII: DoD ID (EDIPI), CAC numbers, security clearance references, CAGE codes, DUNS numbers, SAM UEI
6896
- Financial PII: bank routing numbers, SWIFT/BIC codes, EIN/TIN, bank accounts, IBAN
6997
- Export control markers: ITAR markings, EAR markings, controlled technical data
@@ -356,6 +384,7 @@ pytest tests/ -v
356384

357385
## Version History
358386

387+
- **0.2.1** -- Full HIPAA Safe Harbor coverage (all 18 identifiers), 80+ CUI Registry subcategories, regex bug fixes, 213 passing tests
359388
- **0.2.0** -- Defense compliance upgrade: CUI detection, enhanced PII/PHI, document sanitization, ITAR/EAR screening, FedRAMP audit trails
360389
- **0.1.34** -- Multi-format support, semantic compression, basic PII detection
361390

docsingest/compliance/audit_trail.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -142,7 +142,7 @@ def to_cef(self) -> str:
142142
ext_str = ' '.join(extensions)
143143

144144
return (
145-
f"CEF:0|docsingest|ComplianceAudit|0.2.0|"
145+
f"CEF:0|docsingest|ComplianceAudit|0.2.1|"
146146
f"{self.event_type}|{self.action}|{cef_severity}|{ext_str}"
147147
)
148148

docsingest/compliance/cui_detector.py

Lines changed: 129 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -21,32 +21,135 @@
2121

2222

2323
class CUICategory(Enum):
24-
"""CUI Registry categories per 32 CFR Part 2002."""
24+
"""CUI Registry categories per 32 CFR Part 2002 and NARA CUI Registry.
25+
26+
Organized by CUI Registry groupings from archives.gov/cui/registry/category-list.
27+
Includes both CUI Basic and CUI Specified categories.
28+
"""
29+
# --- Critical Infrastructure ---
2530
CTI = "Controlled Technical Information"
26-
PRVCY = "Privacy"
27-
INTEL = "Intelligence"
28-
EXPT = "Export Controlled"
31+
DCRIT = "Critical Infrastructure"
32+
PCII = "Protected Critical Infrastructure Information"
33+
CEII = "Critical Energy Infrastructure Information"
34+
SSI = "Sensitive Security Information"
35+
36+
# --- Defense ---
2937
ITAR = "International Traffic in Arms Regulations"
38+
EXPT = "Export Controlled"
39+
SAMI = "Controlled Technical Information - Space"
40+
NOFORN_DATA = "Not Releasable to Foreign Nationals Data"
41+
UCNI = "Unclassified Controlled Nuclear Information"
42+
NNPI = "Naval Nuclear Propulsion Information"
43+
TFNI = "Transclassified Foreign Nuclear Information"
44+
DoD_UCTI = "DoD Unclassified Controlled Technical Information"
45+
46+
# --- Export Control ---
47+
EI = "Export Information"
48+
49+
# --- Financial ---
50+
TAX = "Federal Taxpayer Information"
51+
BUDGT = "Budget"
52+
FTI = "Federal Tax Information"
53+
BANK_SECRECY = "Bank Secrecy"
3054
PROPIN = "Proprietary Business Information"
55+
PRIV_FIN = "Privileged Financial Information"
56+
PROCUREMENT = "Source Selection Information"
57+
58+
# --- Immigration ---
59+
IMMIG = "Immigration"
60+
VISA = "Visa Information"
61+
62+
# --- Intelligence ---
63+
INTEL = "Intelligence"
64+
FISA = "Foreign Intelligence Surveillance Act"
65+
HUMINT = "Human Intelligence"
66+
SIGINT = "Signals Intelligence"
67+
GEOINT = "Geospatial Intelligence"
68+
OSINT = "Open Source Intelligence"
69+
MASINT = "Measurement and Signature Intelligence"
70+
71+
# --- International Agreements ---
72+
INTL_AGREE = "International Agreement Information"
73+
74+
# --- Law Enforcement ---
3175
LES = "Law Enforcement Sensitive"
76+
LESI = "Law Enforcement Sensitive Investigation"
77+
GRAND_JURY = "Grand Jury Information"
78+
INFORMANT = "Confidential Informant Identity"
79+
WITNESS = "Witness Protection Information"
80+
SURVEIL = "Surveillance Information"
81+
DEA_SENS = "DEA Sensitive Information"
82+
83+
# --- Legal ---
84+
LEGAL = "Legal Privilege"
85+
ATTY_WORK = "Attorney Work Product"
86+
ATTY_CLIENT = "Attorney-Client Privilege"
87+
DELIBERATIVE = "Deliberative Process"
88+
89+
# --- Natural & Cultural Resources ---
90+
ARCH = "Archaeological Resource Information"
91+
CULTURAL = "Cultural Resource Information"
92+
SPECIES = "Endangered Species Information"
93+
94+
# --- Nuclear ---
95+
OCA = "Original Classification Authority"
96+
NNSA = "NNSA Information"
97+
NUCLEAR = "Nuclear Security Information"
98+
99+
# --- Operations Security ---
100+
OPSEC = "Operations Security"
101+
COMSEC = "Communications Security"
102+
103+
# --- Patent ---
104+
PATENT = "Patent Application Information"
105+
INVENTION = "Invention Secrecy Act"
106+
107+
# --- Privacy ---
108+
PRVCY = "Privacy"
109+
PII = "Personally Identifiable Information"
110+
HIPAA = "Health Insurance Portability and Accountability Act"
111+
GENE = "Genetic Information"
112+
SORN = "System of Records Notice Information"
113+
EDUCATIONAL = "Student Educational Records (FERPA)"
114+
SUBSTANCE = "Substance Abuse Treatment Records (42 CFR Part 2)"
115+
116+
# --- Provisional (Legacy) ---
32117
FOUO = "For Official Use Only"
33118
SBU = "Sensitive But Unclassified"
34-
SSI = "Sensitive Security Information"
35-
PCII = "Protected Critical Infrastructure Information"
119+
120+
# --- Public Health ---
36121
PHLTH = "Public Health"
37-
TAX = "Federal Taxpayer Information"
38-
LEGAL = "Legal Privilege"
39-
OPSEC = "Operations Security"
122+
SELECT_AGENT = "Select Agent and Toxin Information"
123+
BSAT = "Biological Select Agents and Toxins"
124+
PANDEMIC = "Pandemic Preparedness Information"
125+
126+
# --- Safety ---
127+
SAFETY_ACT = "SAFETY Act Information"
128+
CHEM = "Chemical Facility Anti-Terrorism Standards"
129+
130+
# --- Statistical ---
131+
CENSUS = "Census"
132+
CIPSEA = "CIPSEA Statistical Information"
133+
134+
# --- Technology & Science ---
135+
SBIR = "Small Business Innovation Research"
136+
STTR = "Small Business Technology Transfer"
137+
RESEARCH = "Controlled Research Information"
138+
139+
# --- Physical & Information Security ---
40140
PHYS = "Physical Security"
41141
INFOSEC = "Information Systems Vulnerability Information"
42-
BUDGT = "Budget"
43-
CENSUS = "Census"
44-
DCRIT = "Critical Infrastructure"
45-
FISA = "Foreign Intelligence Surveillance Act"
46-
GENE = "Genetic Information"
142+
VULN = "Vulnerability Assessment Information"
143+
PENTEST = "Penetration Testing Information"
144+
INCIDENT = "Cybersecurity Incident Information"
145+
146+
# --- Geospatial ---
47147
GEO = "Geospatial"
48-
PII = "Personally Identifiable Information"
49-
SAMI = "Controlled Technical Information - Space"
148+
GEO_PROD = "Geospatial Product Information"
149+
150+
# --- Transportation ---
151+
SSTI = "Sensitive Surface Transportation Information"
152+
RAIL = "Rail Security Information"
50153

51154

52155
class ClassificationLevel(Enum):
@@ -165,6 +268,16 @@ class CUIDetector:
165268
re.compile(r'\b(LES|LAW\s+ENFORCEMENT\s+SENSITIVE)\b', re.IGNORECASE): CUICategory.LES,
166269
re.compile(r'\b(SSI|SENSITIVE\s+SECURITY\s+INFORMATION)\b', re.IGNORECASE): CUICategory.SSI,
167270
re.compile(r'\b(PCII|PROTECTED\s+CRITICAL\s+INFRASTRUCTURE\s+INFORMATION)\b', re.IGNORECASE): CUICategory.PCII,
271+
re.compile(r'\b(CEII|CRITICAL\s+ENERGY\s+INFRASTRUCTURE\s+INFORMATION)\b', re.IGNORECASE): CUICategory.CEII,
272+
re.compile(r'\b(UCNI|UNCLASSIFIED\s+CONTROLLED\s+NUCLEAR\s+INFORMATION)\b', re.IGNORECASE): CUICategory.UCNI,
273+
re.compile(r'\b(NNPI|NAVAL\s+NUCLEAR\s+PROPULSION\s+INFORMATION)\b', re.IGNORECASE): CUICategory.NNPI,
274+
re.compile(r'\b(FTI|FEDERAL\s+TAX\s+INFORMATION)\b', re.IGNORECASE): CUICategory.FTI,
275+
re.compile(r'\b(SBIR|SMALL\s+BUSINESS\s+INNOVATION\s+RESEARCH)\b', re.IGNORECASE): CUICategory.SBIR,
276+
re.compile(r'\b(STTR|SMALL\s+BUSINESS\s+TECHNOLOGY\s+TRANSFER)\b', re.IGNORECASE): CUICategory.STTR,
277+
re.compile(r'\b(COMSEC|COMMUNICATIONS\s+SECURITY)\b', re.IGNORECASE): CUICategory.COMSEC,
278+
re.compile(r'\b(SAFETY\s+ACT\s+(?:PROTECTED|INFORMATION))\b', re.IGNORECASE): CUICategory.SAFETY_ACT,
279+
re.compile(r'\b(CHEM[-\s]?SECURITY|CFATS|CHEMICAL\s+FACILITY\s+ANTI[-\s]?TERRORISM)\b', re.IGNORECASE): CUICategory.CHEM,
280+
re.compile(r'\b(DELIBERATIVE\s+(?:PROCESS|PRIVILEGE))\b', re.IGNORECASE): CUICategory.DELIBERATIVE,
168281
}
169282

170283
# Classification banner patterns

0 commit comments

Comments
 (0)