|
1 | | -# ShredGuard |
| 1 | +# ShredGuard by [WISC Lab](https://kidspeech.wisc.edu/) |
2 | 2 |
|
3 | | -A CLI tool to scan files for configurable regex patterns (PHI identifiers) and optionally replace matches with deterministic pseudonyms. Integrates with pre-commit framework. |
| 3 | +[](https://github.com/WISCLab/shred-guard/actions/workflows/ci.yml) [](https://github.com/WISCLab/shred-guard/actions/workflows/cd.yml) |
4 | 4 |
|
5 | | -## Features |
| 5 | +Scan files for PHI (Protected Health Information) patterns and replace them with deterministic pseudonyms. Integrates seamlessly with pre-commit hooks. |
6 | 6 |
|
7 | | -- Scan files for PHI patterns using configurable regex |
8 | | -- Replace matches with deterministic pseudonyms (same value = same ID) |
9 | | -- Pre-commit integration for automated checks |
10 | | -- Respects `.gitignore` patterns |
11 | | -- Ruff-style output format (`file:line:col: SGxxx Message`) |
12 | | -- Cross-platform support (Windows, macOS, Linux) |
13 | 7 |
|
14 | | -## Installation |
| 8 | +## Appendix |
15 | 9 |
|
16 | | -```bash |
17 | | -pip install shredguard |
18 | | -``` |
| 10 | +- [Value Proposition](https://raw.githubusercontent.com/WiscLab/shred-guard/main/value-proposition.svg) |
| 11 | +- [Installation](#installation) |
| 12 | +- [Quick Start](#quick-start) |
| 13 | +- [Commands](#commands) |
| 14 | + - [shredguard init](#shredguard-init) |
| 15 | + - [shredguard check](#shredguard-check) |
| 16 | + - [shredguard fix](#shredguard-fix) |
| 17 | +- [Configuration](#configuration) |
| 18 | +- [Pre-commit](#pre-commit) |
| 19 | +- [Reference](#reference) |
| 20 | + - [CLI Options](#cli-options) |
| 21 | + - [Configuration Reference](#configuration-reference) |
| 22 | + - [Built-in Pattern Suggestions](#built-in-pattern-suggestions) |
| 23 | + - [Exit Codes](#exit-codes) |
| 24 | + - [Binary File Handling](#binary-file-handling) |
| 25 | +- [License](#license) |
19 | 26 |
|
20 | | -Or install from source: |
| 27 | +## Installation |
21 | 28 |
|
22 | 29 | ```bash |
23 | | -pip install -e . |
| 30 | +pip install shred-guard |
24 | 31 | ``` |
25 | 32 |
|
26 | 33 | ## Quick Start |
27 | 34 |
|
28 | | -1. Add configuration to your `pyproject.toml`: |
29 | | - |
30 | | -```toml |
31 | | -[tool.shredguard] |
32 | | -[[tool.shredguard.patterns]] |
33 | | -regex = "SUB-\\d{4,6}" |
34 | | -description = "Subject ID" |
35 | | - |
36 | | -[[tool.shredguard.patterns]] |
37 | | -regex = "\\b\\d{3}-\\d{2}-\\d{4}\\b" |
38 | | -description = "SSN-like pattern" |
39 | | - |
40 | | -[[tool.shredguard.patterns]] |
41 | | -regex = "MRN\\d{6,10}" |
42 | | -description = "Medical Record Number" |
43 | | -``` |
44 | | - |
45 | | -2. Run a check: |
| 35 | +Run the interactive setup wizard: |
46 | 36 |
|
47 | 37 | ```bash |
48 | | -shredguard check . |
| 38 | +shredguard init |
49 | 39 | ``` |
50 | 40 |
|
51 | | -3. Fix (replace) found patterns: |
| 41 | +This walks you through: |
| 42 | +- Selecting PHI patterns to detect (SSNs, emails, MRNs, custom patterns) |
| 43 | +- Configuring file restrictions |
| 44 | +- Setting up pre-commit hooks |
52 | 45 |
|
53 | | -```bash |
54 | | -shredguard fix --prefix REDACTED . |
55 | | -``` |
| 46 | +## Commands |
56 | 47 |
|
57 | | -## Pre-commit Setup |
| 48 | +### `shredguard init` |
58 | 49 |
|
59 | | -Add to your `.pre-commit-config.yaml`: |
60 | | - |
61 | | -```yaml |
62 | | -repos: |
63 | | - - repo: https://github.com/shredguard/shredguard |
64 | | - rev: v0.1.0 |
65 | | - hooks: |
66 | | - - id: shredguard-check |
67 | | -``` |
68 | | -
|
69 | | -Or to automatically fix: |
70 | | -
|
71 | | -```yaml |
72 | | -repos: |
73 | | - - repo: https://github.com/shredguard/shredguard |
74 | | - rev: v0.1.0 |
75 | | - hooks: |
76 | | - - id: shredguard-fix |
77 | | - args: [--prefix, REDACTED] |
78 | | -``` |
79 | | -
|
80 | | -## CLI Reference |
| 50 | +Interactive setup wizard. Creates your configuration and optionally sets up pre-commit integration. |
81 | 51 |
|
82 | 52 | ### `shredguard check` |
83 | 53 |
|
84 | | -Scan files for PHI patterns. |
| 54 | +Scan for PHI patterns: |
85 | 55 |
|
86 | 56 | ```bash |
87 | | -shredguard check [OPTIONS] [FILES]... |
| 57 | +shredguard check . # Scan current directory |
| 58 | +shredguard check data/ notes.txt # Scan specific paths |
88 | 59 | ``` |
89 | 60 |
|
90 | | -**Arguments:** |
91 | | -- `FILES` - Files or directories to scan (default: current directory) |
92 | | - |
93 | | -**Options:** |
94 | | -- `--all-files` - Scan all files (typically used with pre-commit) |
95 | | -- `--no-gitignore` - Don't respect `.gitignore` patterns |
96 | | -- `--config PATH` - Path to config file (default: searches for `pyproject.toml`) |
97 | | -- `--verbose, -v` - Show verbose output (skipped files, etc.) |
98 | | - |
99 | | -**Exit codes:** |
100 | | -- `0` - No matches found |
101 | | -- `1` - Matches found or error |
| 61 | +Output uses ruff-style formatting: |
| 62 | +``` |
| 63 | +patient_notes.txt:1:9: SG001 Subject ID [SUB-1234] |
| 64 | +patient_notes.txt:2:6: SG002 SSN [123-45-6789] |
| 65 | +``` |
102 | 66 |
|
103 | 67 | ### `shredguard fix` |
104 | 68 |
|
105 | | -Replace PHI patterns with pseudonyms. |
| 69 | +Replace PHI with pseudonyms: |
106 | 70 |
|
107 | 71 | ```bash |
108 | | -shredguard fix [OPTIONS] [FILES]... |
| 72 | +shredguard fix . # Replace with REDACTED-0, REDACTED-1, ... |
| 73 | +shredguard fix --prefix ANON . # Custom prefix: ANON-0, ANON-1, ... |
| 74 | +shredguard fix --output-map mapping.json . # Save original -> pseudonym mapping |
109 | 75 | ``` |
110 | 76 |
|
111 | | -**Arguments:** |
112 | | -- `FILES` - Files or directories to scan and fix (default: current directory) |
113 | | - |
114 | | -**Options:** |
115 | | -- `--prefix PREFIX` - Prefix for pseudonyms (default: `REDACTED`) |
116 | | -- `--output-map PATH` - Write JSON mapping of originals to pseudonyms |
117 | | -- `--all-files` - Scan all files |
118 | | -- `--no-gitignore` - Don't respect `.gitignore` patterns |
119 | | -- `--config PATH` - Path to config file |
120 | | -- `--verbose, -v` - Show verbose output |
| 77 | +Replacements are deterministic: and the same value always gets the same pseudonym within a run. |
121 | 78 |
|
122 | | -## Configuration Reference |
| 79 | +## Configuration |
123 | 80 |
|
124 | | -Configuration is read from `pyproject.toml` under the `[tool.shredguard]` section. |
| 81 | +Configuration lives in `pyproject.toml` (or `/*/*.toml` set with --config): |
125 | 82 |
|
126 | | -### Patterns |
| 83 | +```toml |
| 84 | +[tool.shredguard] |
127 | 85 |
|
128 | | -Define patterns to scan for: |
| 86 | +[[tool.shredguard.patterns]] |
| 87 | +regex = "SUB-\\d{4,6}" |
| 88 | +description = "Subject ID" |
129 | 89 |
|
130 | | -```toml |
131 | 90 | [[tool.shredguard.patterns]] |
132 | | -regex = "SUB-\\d{4,6}" # Required: regex pattern |
133 | | -description = "Subject ID" # Optional: description for output |
134 | | -files = ["*.csv", "data/**"] # Optional: only scan matching files |
135 | | -exclude_files = ["*_test.*"] # Optional: exclude matching files |
| 91 | +regex = "\\b\\d{3}-\\d{2}-\\d{4}\\b" |
| 92 | +description = "SSN" |
136 | 93 | ``` |
137 | 94 |
|
138 | | -**Pattern fields:** |
139 | | -- `regex` (required) - Regular expression pattern to match |
140 | | -- `description` (optional) - Human-readable description shown in output |
141 | | -- `files` (optional) - List of glob patterns; only scan files matching these |
142 | | -- `exclude_files` (optional) - List of glob patterns; skip files matching these |
143 | | - |
144 | | -### Pattern Codes |
145 | | - |
146 | | -Patterns are assigned stable codes (SG001, SG002, etc.) based on their order in the config file. |
| 95 | +Each pattern can optionally include `files` and `exclude_files` globs to control which files are scanned. |
147 | 96 |
|
148 | | -## Example Workflow |
| 97 | +## Pre-commit |
149 | 98 |
|
150 | | -1. Create test files with PHI: |
| 99 | +Add to `.pre-commit-config.yaml`: |
151 | 100 |
|
152 | | -```bash |
153 | | -echo "Patient SUB-1234 was enrolled on 2024-01-15" > patient_notes.txt |
154 | | -echo "SSN: 123-45-6789" >> patient_notes.txt |
| 101 | +```yaml |
| 102 | +repos: |
| 103 | + - repo: local |
| 104 | + hooks: |
| 105 | + - id: shredguard-check |
| 106 | + name: shredguard check |
| 107 | + entry: shredguard check |
| 108 | + language: system |
| 109 | + types: [text] |
155 | 110 | ``` |
156 | 111 |
|
157 | | -2. Check for PHI: |
| 112 | +Or let `shredguard init` set this up for you. |
158 | 113 |
|
159 | | -```bash |
160 | | -$ shredguard check patient_notes.txt |
161 | | -patient_notes.txt:1:9: SG001 Subject ID [SUB-1234] |
162 | | -patient_notes.txt:2:6: SG002 SSN-like pattern [123-45-6789] |
| 114 | +## Reference |
163 | 115 |
|
164 | | -Found 2 matches in 1 file |
165 | | -``` |
| 116 | +### CLI Options |
166 | 117 |
|
167 | | -3. Replace PHI with pseudonyms: |
| 118 | +**`shredguard check [OPTIONS] [FILES]...`** |
168 | 119 |
|
169 | | -```bash |
170 | | -$ shredguard fix --prefix ANON --output-map mapping.json patient_notes.txt |
171 | | -Replaced 2 occurrences of 2 unique values in 1 file |
172 | | -Mapping written to: mapping.json |
173 | | -``` |
| 120 | +| Option | Description | |
| 121 | +|--------|-------------| |
| 122 | +| `--all-files` | Scan all files recursively | |
| 123 | +| `--no-gitignore` | Don't respect `.gitignore` patterns | |
| 124 | +| `--config PATH` | Path to config file | |
| 125 | +| `-v, --verbose` | Show verbose output (skipped files, etc.) | |
174 | 126 |
|
175 | | -4. Verify replacement: |
| 127 | +**`shredguard fix [OPTIONS] [FILES]...`** |
176 | 128 |
|
177 | | -```bash |
178 | | -$ cat patient_notes.txt |
179 | | -Patient ANON-0 was enrolled on 2024-01-15 |
180 | | -SSN: ANON-1 |
181 | | -
|
182 | | -$ cat mapping.json |
183 | | -{ |
184 | | - "SUB-1234": "ANON-0", |
185 | | - "123-45-6789": "ANON-1" |
186 | | -} |
187 | | -``` |
| 129 | +| Option | Description | |
| 130 | +|--------|-------------| |
| 131 | +| `--prefix TEXT` | Prefix for pseudonyms (default: `REDACTED`) | |
| 132 | +| `--output-map PATH` | Write JSON mapping of originals to pseudonyms | |
| 133 | +| `--all-files` | Scan all files recursively | |
| 134 | +| `--no-gitignore` | Don't respect `.gitignore` patterns | |
| 135 | +| `--config PATH` | Path to config file | |
| 136 | +| `-v, --verbose` | Show verbose output | |
188 | 137 |
|
189 | | -## Deterministic Replacement |
| 138 | +### Configuration Reference |
190 | 139 |
|
191 | | -ShredGuard uses deterministic pseudonym assignment: |
192 | | -- The same matched value always gets the same pseudonym within a single run |
193 | | -- IDs are assigned in order of first encounter (0, 1, 2, ...) |
194 | | -- The mapping can be saved to a JSON file for reference |
| 140 | +```toml |
| 141 | +[[tool.shredguard.patterns]] |
| 142 | +regex = "SUB-\\d{4,6}" # Required: regex pattern |
| 143 | +description = "Subject ID" # Optional: shown in output |
| 144 | +files = ["*.csv", "data/**"] # Optional: only scan matching files |
| 145 | +exclude_files = ["*_test.*"] # Optional: skip matching files |
| 146 | +``` |
195 | 147 |
|
196 | | -## Binary File Detection |
| 148 | +### Built-in Pattern Suggestions |
197 | 149 |
|
198 | | -Binary files are automatically detected and skipped using a null byte heuristic (checking first 8KB). Use `--verbose` to see which files are skipped. |
| 150 | +When running `shredguard init`, you can choose from these common patterns: |
199 | 151 |
|
200 | | -## Development |
| 152 | +| Pattern | Description | |
| 153 | +|---------|-------------| |
| 154 | +| `SUB-\d{4,6}` | Subject ID | |
| 155 | +| `\b\d{3}-\d{2}-\d{4}\b` | Social Security Number | |
| 156 | +| `MRN\d{6,10}` | Medical Record Number | |
| 157 | +| `[email pattern]` | Email addresses | |
| 158 | +| `[phone pattern]` | Phone numbers (10 digits) | |
| 159 | +| `\b\d{5}(?:-\d{4})?\b` | ZIP codes | |
201 | 160 |
|
202 | | -Install development dependencies: |
| 161 | +### Exit Codes |
203 | 162 |
|
204 | | -```bash |
205 | | -pip install -e ".[dev]" |
206 | | -``` |
| 163 | +| Code | Meaning | |
| 164 | +|------|---------| |
| 165 | +| `0` | Success (no matches found for `check`) | |
| 166 | +| `1` | Matches found or error | |
207 | 167 |
|
208 | | -Run tests: |
| 168 | +### Binary File Handling |
209 | 169 |
|
210 | | -```bash |
211 | | -pytest |
212 | | -``` |
| 170 | +Binary files are automatically detected and skipped (null byte check in first 8KB). Use `--verbose` to see skipped files. |
213 | 171 |
|
214 | 172 | ## License |
215 | 173 |
|
216 | | -MIT |
| 174 | +[MIT](https://github.com/WiscLab/shred-guard/blob/main/LICENSE) |
0 commit comments