Skip to content

Commit 035ea52

Browse files
committed
Add ShredGuard CLI tool
for PHI pattern detection and redaction Implement a Python CLI tool that scans files for configurable regex patterns (PHI identifiers) and optionally replaces matches with deterministic pseudonyms. Features include: - .git/COMMIT_EDITMSG:7:42: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:8:43: SG002 SSN-like pattern [123-45-6789] .git/COMMIT_EDITMSG:9:37: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:10:43: SG002 SSN-like pattern [123-45-6789] .git/COMMIT_EDITMSG:11:36: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:12:42: SG002 SSN-like pattern [123-45-6789] .git/COMMIT_EDITMSG:13:44: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:14:44: SG001 Subject ID [SUB-5678] .git/COMMIT_EDITMSG:15:50: SG002 SSN-like pattern [123-45-6789] .git/COMMIT_EDITMSG:16:44: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:17:44: SG001 Subject ID [SUB-5678] .git/COMMIT_EDITMSG:18:44: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:19:44: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:20:44: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:21:55: SG003 Medical Record Number [MRN123456] .git/COMMIT_EDITMSG:22:44: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:23:55: SG003 Medical Record Number [MRN123456] .git/COMMIT_EDITMSG:24:45: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:25:45: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:26:45: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:27:45: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:28:45: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:29:45: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:30:45: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:31:45: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:32:45: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:33:45: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:34:45: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:35:45: SG001 Subject ID [SUB-5678] .git/COMMIT_EDITMSG:36:46: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:37:46: SG001 Subject ID [SUB-5678] .git/COMMIT_EDITMSG:38:46: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:39:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:40:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:41:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:42:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:43:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:44:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:45:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:46:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:47:47: SG001 Subject ID [SUB-5678] .git/COMMIT_EDITMSG:48:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:49:47: SG001 Subject ID [SUB-5678] .git/COMMIT_EDITMSG:50:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:51:47: SG001 Subject ID [SUB-5678] .git/COMMIT_EDITMSG:52:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:53:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:54:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:55:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:56:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:57:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:58:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:59:47: SG001 Subject ID [SUB-5678] .git/COMMIT_EDITMSG:60:47: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:61:47: SG001 Subject ID [SUB-5678] .git/COMMIT_EDITMSG:62:48: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:63:48: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:64:49: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:65:49: SG001 Subject ID [SUB-5678] .git/COMMIT_EDITMSG:66:49: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:67:49: SG001 Subject ID [SUB-5678] .git/COMMIT_EDITMSG:68:49: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:69:60: SG003 Medical Record Number [MRN123456] .git/COMMIT_EDITMSG:70:49: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:71:49: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:72:49: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:73:49: SG001 Subject ID [SUB-5678] .git/COMMIT_EDITMSG:74:49: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:75:49: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:76:49: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:77:49: SG001 Subject ID [SUB-5678] .git/COMMIT_EDITMSG:78:49: SG001 Subject ID [SUB-2222] .git/COMMIT_EDITMSG:79:49: SG001 Subject ID [SUB-1111] .git/COMMIT_EDITMSG:80:49: SG001 Subject ID [SUB-3333] .git/COMMIT_EDITMSG:81:49: SG001 Subject ID [SUB-1234] .git/COMMIT_EDITMSG:82:49: SG001 Subject ID [SUB-1234] README.md:153:15: SG001 Subject ID [SUB-1234] README.md:154:12: SG002 SSN-like pattern [123-45-6789] README.md:161:42: SG001 Subject ID [SUB-1234] README.md:162:48: SG002 SSN-like pattern [123-45-6789] README.md:184:4: SG001 Subject ID [SUB-1234] README.md:185:4: SG002 SSN-like pattern [123-45-6789] initial-prompt.md:55:68: SG001 Subject ID [SUB-1234] initial-prompt.md:56:69: SG001 Subject ID [SUB-5678] initial-prompt.md:57:53: SG002 SSN-like pattern [123-45-6789] initial-prompt.md:75:67: SG001 Subject ID [SUB-1234] initial-prompt.md:75:91: SG001 Subject ID [SUB-5678] tests/test_cli.py:50:39: SG001 Subject ID [SUB-1234] tests/test_cli.py:55:17: SG001 Subject ID [SUB-1234] tests/test_cli.py:62:31: SG001 Subject ID [SUB-1234] tests/test_cli.py:62:44: SG003 Medical Record Number [MRN123456] tests/test_cli.py:67:17: SG001 Subject ID [SUB-1234] tests/test_cli.py:68:17: SG003 Medical Record Number [MRN123456] tests/test_cli.py:103:34: SG001 Subject ID [SUB-1234] tests/test_cli.py:111:17: SG001 Subject ID [SUB-1234] tests/test_cli.py:121:34: SG001 Subject ID [SUB-1234] tests/test_cli.py:129:17: SG001 Subject ID [SUB-1234] tests/test_cli.py:134:31: SG001 Subject ID [SUB-1234] tests/test_cli.py:148:39: SG001 Subject ID [SUB-1234] tests/test_cli.py:173:31: SG001 Subject ID [SUB-1234] tests/test_cli.py:192:29: SG001 Subject ID [SUB-1234] tests/test_cli.py:197:31: SG001 Subject ID [SUB-1234] tests/test_cli.py:209:31: SG001 Subject ID [SUB-1234] tests/test_cli.py:221:27: SG001 Subject ID [SUB-1234] tests/test_cli.py:224:27: SG001 Subject ID [SUB-5678] tests/test_fixer.py:31:36: SG001 Subject ID [SUB-1234] tests/test_fixer.py:32:36: SG001 Subject ID [SUB-5678] tests/test_fixer.py:33:36: SG001 Subject ID [SUB-1234] tests/test_fixer.py:115:39: SG001 Subject ID [SUB-1234] tests/test_fixer.py:122:27: SG001 Subject ID [SUB-1234] tests/test_fixer.py:131:36: SG001 Subject ID [SUB-1234] tests/test_fixer.py:139:31: SG001 Subject ID [SUB-1234] tests/test_fixer.py:139:44: SG001 Subject ID [SUB-1234] tests/test_fixer.py:143:67: SG001 Subject ID [SUB-1234] tests/test_fixer.py:144:68: SG001 Subject ID [SUB-1234] tests/test_fixer.py:156:31: SG001 Subject ID [SUB-1234] tests/test_fixer.py:156:44: SG001 Subject ID [SUB-5678] tests/test_fixer.py:160:67: SG001 Subject ID [SUB-1234] tests/test_fixer.py:161:68: SG001 Subject ID [SUB-5678] tests/test_fixer.py:167:17: SG001 Subject ID [SUB-1234] tests/test_fixer.py:168:17: SG001 Subject ID [SUB-5678] tests/test_fixer.py:180:27: SG001 Subject ID [SUB-1234] tests/test_fixer.py:193:31: SG001 Subject ID [SUB-1234] tests/test_fixer.py:202:27: SG001 Subject ID [SUB-1234] tests/test_fixer.py:210:29: SG001 Subject ID [SUB-1234] tests/test_fixer.py:223:31: SG001 Subject ID [SUB-1234] tests/test_fixer.py:232:27: SG001 Subject ID [SUB-1234] tests/test_fixer.py:243:27: SG001 Subject ID [SUB-1234] tests/test_fixer.py:246:27: SG001 Subject ID [SUB-5678] tests/test_fixer.py:250:63: SG001 Subject ID [SUB-1234] tests/test_fixer.py:251:63: SG001 Subject ID [SUB-5678] tests/test_scanner.py:88:39: SG001 Subject ID [SUB-1234] tests/test_scanner.py:95:44: SG001 Subject ID [SUB-1234] tests/test_scanner.py:102:31: SG001 Subject ID [SUB-1234] tests/test_scanner.py:102:44: SG001 Subject ID [SUB-5678] tests/test_scanner.py:108:44: SG001 Subject ID [SUB-1234] tests/test_scanner.py:109:44: SG001 Subject ID [SUB-5678] tests/test_scanner.py:114:31: SG001 Subject ID [SUB-1234] tests/test_scanner.py:114:40: SG003 Medical Record Number [MRN123456] tests/test_scanner.py:130:45: SG001 Subject ID [SUB-1234] tests/test_scanner.py:142:35: SG001 Subject ID [SUB-1234] tests/test_scanner.py:153:30: SG001 Subject ID [SUB-1234] tests/test_scanner.py:156:30: SG001 Subject ID [SUB-5678] tests/test_scanner.py:170:31: SG001 Subject ID [SUB-1234] tests/test_scanner.py:180:42: SG001 Subject ID [SUB-1234] tests/test_scanner.py:195:27: SG001 Subject ID [SUB-1234] tests/test_scanner.py:198:27: SG001 Subject ID [SUB-5678] tests/test_scanner.py:209:27: SG001 Subject ID [SUB-2222] tests/test_scanner.py:209:37: SG001 Subject ID [SUB-1111] tests/test_scanner.py:212:27: SG001 Subject ID [SUB-3333] tests/test_scanner.py:227:31: SG001 Subject ID [SUB-1234] tests/test_scanner.py:249:27: SG001 Subject ID [SUB-1234] ✗ Found 152 matches in 6 files command to scan for PHI patterns - command to replace with pseudonyms - Pre-commit framework integration - Respects .gitignore patterns - Ruff-style output format (file:line:col: SGxxx Message) - Prefix collision detection before replacements - JSON mapping output for traceability - Cross-platform CI (Python 3.10-3.12 on Linux/macOS/Windows) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent ef84107 commit 035ea52

File tree

18 files changed

+2647
-1
lines changed

18 files changed

+2647
-1
lines changed

.github/workflows/ci.yml

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
7+
branches: [main]
8+
9+
jobs:
10+
test:
11+
runs-on: ${{ matrix.os }}
12+
strategy:
13+
fail-fast: false
14+
matrix:
15+
os: [ubuntu-latest, macos-latest, windows-latest]
16+
python-version: ["3.10", "3.11", "3.12"]
17+
18+
steps:
19+
- uses: actions/checkout@v4
20+
21+
- name: Set up Python ${{ matrix.python-version }}
22+
uses: actions/setup-python@v5
23+
with:
24+
python-version: ${{ matrix.python-version }}
25+
26+
- name: Install dependencies
27+
run: |
28+
python -m pip install --upgrade pip
29+
pip install -e ".[dev]"
30+
31+
- name: Run tests
32+
run: |
33+
pytest -v --cov=shredguard --cov-report=xml
34+
35+
- name: Upload coverage
36+
uses: codecov/codecov-action@v4
37+
if: matrix.os == 'ubuntu-latest' && matrix.python-version == '3.12'
38+
with:
39+
files: ./coverage.xml
40+
fail_ci_if_error: false
41+
42+
lint:
43+
runs-on: ubuntu-latest
44+
steps:
45+
- uses: actions/checkout@v4
46+
47+
- name: Set up Python
48+
uses: actions/setup-python@v5
49+
with:
50+
python-version: "3.12"
51+
52+
- name: Install dependencies
53+
run: |
54+
python -m pip install --upgrade pip
55+
pip install ruff
56+
57+
- name: Run ruff check
58+
run: ruff check src/ tests/
59+
60+
- name: Run ruff format check
61+
run: ruff format --check src/ tests/

.gitignore

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# Distribution / packaging
7+
dist/
8+
build/
9+
*.egg-info/
10+
*.egg
11+
12+
# Virtual environments
13+
.venv/
14+
venv/
15+
ENV/
16+
17+
# IDE
18+
.idea/
19+
.vscode/
20+
*.swp
21+
*.swo
22+
23+
# Testing
24+
.pytest_cache/
25+
.coverage
26+
htmlcov/
27+
coverage.xml
28+
29+
# OS
30+
.DS_Store
31+
Thumbs.db

.pre-commit-hooks.yaml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
- id: shredguard-check
2+
name: shredguard check
3+
description: Check for PHI patterns in files
4+
entry: shredguard check
5+
language: python
6+
types: [text]
7+
require_serial: false
8+
9+
- id: shredguard-fix
10+
name: shredguard fix
11+
description: Replace PHI patterns with pseudonyms
12+
entry: shredguard fix
13+
language: python
14+
types: [text]
15+
require_serial: false

README.md

Lines changed: 216 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,216 @@
1-
# shred-guard
1+
# ShredGuard
2+
3+
A CLI tool to scan files for configurable regex patterns (PHI identifiers) and optionally replace matches with deterministic pseudonyms. Integrates with pre-commit framework.
4+
5+
## Features
6+
7+
- Scan files for PHI patterns using configurable regex
8+
- Replace matches with deterministic pseudonyms (same value = same ID)
9+
- Pre-commit integration for automated checks
10+
- Respects `.gitignore` patterns
11+
- Ruff-style output format (`file:line:col: SGxxx Message`)
12+
- Cross-platform support (Windows, macOS, Linux)
13+
14+
## Installation
15+
16+
```bash
17+
pip install shredguard
18+
```
19+
20+
Or install from source:
21+
22+
```bash
23+
pip install -e .
24+
```
25+
26+
## Quick Start
27+
28+
1. Add configuration to your `pyproject.toml`:
29+
30+
```toml
31+
[tool.shredguard]
32+
[[tool.shredguard.patterns]]
33+
regex = "SUB-\\d{4,6}"
34+
description = "Subject ID"
35+
36+
[[tool.shredguard.patterns]]
37+
regex = "\\b\\d{3}-\\d{2}-\\d{4}\\b"
38+
description = "SSN-like pattern"
39+
40+
[[tool.shredguard.patterns]]
41+
regex = "MRN\\d{6,10}"
42+
description = "Medical Record Number"
43+
```
44+
45+
2. Run a check:
46+
47+
```bash
48+
shredguard check .
49+
```
50+
51+
3. Fix (replace) found patterns:
52+
53+
```bash
54+
shredguard fix --prefix REDACTED .
55+
```
56+
57+
## Pre-commit Setup
58+
59+
Add to your `.pre-commit-config.yaml`:
60+
61+
```yaml
62+
repos:
63+
- repo: https://github.com/shredguard/shredguard
64+
rev: v0.1.0
65+
hooks:
66+
- id: shredguard-check
67+
```
68+
69+
Or to automatically fix:
70+
71+
```yaml
72+
repos:
73+
- repo: https://github.com/shredguard/shredguard
74+
rev: v0.1.0
75+
hooks:
76+
- id: shredguard-fix
77+
args: [--prefix, REDACTED]
78+
```
79+
80+
## CLI Reference
81+
82+
### `shredguard check`
83+
84+
Scan files for PHI patterns.
85+
86+
```bash
87+
shredguard check [OPTIONS] [FILES]...
88+
```
89+
90+
**Arguments:**
91+
- `FILES` - Files or directories to scan (default: current directory)
92+
93+
**Options:**
94+
- `--all-files` - Scan all files (typically used with pre-commit)
95+
- `--no-gitignore` - Don't respect `.gitignore` patterns
96+
- `--config PATH` - Path to config file (default: searches for `pyproject.toml`)
97+
- `--verbose, -v` - Show verbose output (skipped files, etc.)
98+
99+
**Exit codes:**
100+
- `0` - No matches found
101+
- `1` - Matches found or error
102+
103+
### `shredguard fix`
104+
105+
Replace PHI patterns with pseudonyms.
106+
107+
```bash
108+
shredguard fix [OPTIONS] [FILES]...
109+
```
110+
111+
**Arguments:**
112+
- `FILES` - Files or directories to scan and fix (default: current directory)
113+
114+
**Options:**
115+
- `--prefix PREFIX` - Prefix for pseudonyms (default: `REDACTED`)
116+
- `--output-map PATH` - Write JSON mapping of originals to pseudonyms
117+
- `--all-files` - Scan all files
118+
- `--no-gitignore` - Don't respect `.gitignore` patterns
119+
- `--config PATH` - Path to config file
120+
- `--verbose, -v` - Show verbose output
121+
122+
## Configuration Reference
123+
124+
Configuration is read from `pyproject.toml` under the `[tool.shredguard]` section.
125+
126+
### Patterns
127+
128+
Define patterns to scan for:
129+
130+
```toml
131+
[[tool.shredguard.patterns]]
132+
regex = "SUB-\\d{4,6}" # Required: regex pattern
133+
description = "Subject ID" # Optional: description for output
134+
files = ["*.csv", "data/**"] # Optional: only scan matching files
135+
exclude_files = ["*_test.*"] # Optional: exclude matching files
136+
```
137+
138+
**Pattern fields:**
139+
- `regex` (required) - Regular expression pattern to match
140+
- `description` (optional) - Human-readable description shown in output
141+
- `files` (optional) - List of glob patterns; only scan files matching these
142+
- `exclude_files` (optional) - List of glob patterns; skip files matching these
143+
144+
### Pattern Codes
145+
146+
Patterns are assigned stable codes (SG001, SG002, etc.) based on their order in the config file.
147+
148+
## Example Workflow
149+
150+
1. Create test files with PHI:
151+
152+
```bash
153+
echo "Patient SUB-1234 was enrolled on 2024-01-15" > patient_notes.txt
154+
echo "SSN: 123-45-6789" >> patient_notes.txt
155+
```
156+
157+
2. Check for PHI:
158+
159+
```bash
160+
$ shredguard check patient_notes.txt
161+
patient_notes.txt:1:9: SG001 Subject ID [SUB-1234]
162+
patient_notes.txt:2:6: SG002 SSN-like pattern [123-45-6789]
163+
164+
Found 2 matches in 1 file
165+
```
166+
167+
3. Replace PHI with pseudonyms:
168+
169+
```bash
170+
$ shredguard fix --prefix ANON --output-map mapping.json patient_notes.txt
171+
Replaced 2 occurrences of 2 unique values in 1 file
172+
Mapping written to: mapping.json
173+
```
174+
175+
4. Verify replacement:
176+
177+
```bash
178+
$ cat patient_notes.txt
179+
Patient ANON-0 was enrolled on 2024-01-15
180+
SSN: ANON-1
181+
182+
$ cat mapping.json
183+
{
184+
"SUB-1234": "ANON-0",
185+
"123-45-6789": "ANON-1"
186+
}
187+
```
188+
189+
## Deterministic Replacement
190+
191+
ShredGuard uses deterministic pseudonym assignment:
192+
- The same matched value always gets the same pseudonym within a single run
193+
- IDs are assigned in order of first encounter (0, 1, 2, ...)
194+
- The mapping can be saved to a JSON file for reference
195+
196+
## Binary File Detection
197+
198+
Binary files are automatically detected and skipped using a null byte heuristic (checking first 8KB). Use `--verbose` to see which files are skipped.
199+
200+
## Development
201+
202+
Install development dependencies:
203+
204+
```bash
205+
pip install -e ".[dev]"
206+
```
207+
208+
Run tests:
209+
210+
```bash
211+
pytest
212+
```
213+
214+
## License
215+
216+
MIT

0 commit comments

Comments
 (0)