You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+36-8Lines changed: 36 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -172,6 +172,11 @@ usage: redactomatic.py
172
172
[--verbose]
173
173
[--no-verbose]
174
174
[--traceback]
175
+
[-sd, --startdate]
176
+
[-ed, --enddate]
177
+
[-co, --chunkoutstem]
178
+
[-is, --instem]
179
+
[-os, --outstem]
175
180
```
176
181
177
182
### Command Line Parameters
@@ -200,11 +205,18 @@ usage: redactomatic.py
200
205
|`--rulefile`| A list of filenames defining custom rules in YML or JSON. Add to or override default rules (see --defaultrules). These are globbable. |*OPTIONAL*|
201
206
|`--regextest`| Test the regular rexpressions defiend in the regex-test rules prior to any other processing. |*OPTIONAL*|
202
207
|`--testoutputfile`| The file to save the regular expression test results in. |*OPTIONAL*|
203
-
|`--traceback`</br>`--no-traceback`| Give traceback information when an exceptin causes the program to halt. | no-traceback |
208
+
|`--traceback`</br>`--no-traceback`| Give traceback information when an exceptin causes the program to halt. | no-traceback |
204
209
|`--version`| Print the version and exit |*TERMINAL*|
205
210
|`--verbose`</br>`--no-verbose`| Print the status of processing steps to standard output. | verbose |
211
+
|`-sd`,`--startdate`| Start date for date-based processing |*OPTIONAL*|
212
+
|`-ed`,`--enddate`| End date for date-based processing |*OPTIONAL*|
213
+
|`-co`,`--chunkoutstem`| Base name for chunked output files when using date-based processing |*OPTIONAL*|
214
+
|`-is`,`--instem`| Base name for input files when using date-based processing |*OPTIONAL*|
215
+
|`-os`,`--outstem`| Base name for output files when using date-based processing |*OPTIONAL*|
216
+
|`--chunkgather`| The number of chunks to gather into separate chunked output files using the `outstem` as the base name. |*OPTIONAL*|
206
217
207
-
(\*) ***The command must either specifiy the --header option or give the --column and --idcolumn options.***
218
+
219
+
d must either specifiy the --header option or give the --column and --idcolumn options.***
### Example 6: Using date-based filenames and gathered chunks
270
+
271
+
In order to support date-based file naming, redactomatic supports the ability to take input and output file stems plus start date and end date postfixes and create the input and output file names from them. It can also create a final output that is chopped into gathered chunks if you do not want to have a number of smaller files in the output rather than one large one.
- Reads the input from the file `mydir/input_2024-01-01_2024-01-31.csv`
280
+
- Reads this input file in chunks of 1000 lines
281
+
- Appends these chunks into the output file `myotherdir/ouput_2024-01-01_2024-01-31.csv` as they are processed
282
+
- Every 40 chunks, the output file is copied to `andanotherdir/chunk_2024-01-01_2024-01-31_{chunk_no}.csv` where {chunk_no} will be 0,39,79,119 etc.
283
+
284
+
When the process has completed the output file `myotherdir/ouput_2024-01-01_2024-01-31.csv` is deleted leaving just the chunked output files.
285
+
257
286
## Redaction Levels
258
287
259
288
You can define the entities that will be redacted (and hence anonymized). By default four levels are defined '0', '1', '2', '3' and '4'. Level '0' does no redaction. It can be used in cases where the input is alread redacted and translation of redaction tokens into standard formats are needed. Level 1 means that Redactomatic will only use the machine learning NER parser, which captures many entities but is not reliable and does not match addresses, phone numbers, SSN, and other kinds of numbers that are probably important to recognize. Level 2 is the default level and matches most PII entities. However, it can miss numbers that aren't supported or are formatting in a way that Redactomatic hasn't seen before. Level '3' adds in Ordinals (e.g. '123 or one two three') so that orginal numbers are always redacted whether they are a recognized type or not. It also adds in dates, percentages and times as there are also numeric items. It does not redact cardinals (e.g. 'first', 'second', 'third'). Finally we have added an additional level '4' which contains all the entities from the machine learning NER parser and also adds in Ordinals. It is unlikely that level 4 will be helpful but it is included as the highest level of redaction possible.
@@ -485,7 +514,7 @@ level:
485
514
486
515
**Example level definition (YAML)**
487
516
488
-
Any number of level definitions may be set. The defalt configuration files contain three level keys '1', '2', '3' and '4', an extract of which is shown above. The ` --level` option uses the entity list that is found in the relevant matching section. Level keys do not need to be numeric. You can add as many levels as you want.
517
+
Any number of level definitions may be set. The defalt configuration files contain five level keys '0', '1', '2', '3' and '4', an extract of which is shown above. The ` --level` option uses the entity list that is found in the relevant matching section. Level keys do not need to be numeric. You can add as many levels as you want.
489
518
490
519
To define a custom level you can add a custom configuraton file with its own level entry as shown below:
491
520
@@ -1227,7 +1256,8 @@ There are no known issues.
1227
1256
| 1.19 | - Added default option to compile a single regex for a whole phrase list to make it more efficient to RedactorPhraseDict and RedactorPhraseList</br>- Added combine-sets parameter to support turning this off if required</br>- Added complete prematch and postmatch support for RedactorPhraseDict and RedactorPhraseList</br>- Added add-wordbreak parameter to RedactorPhraseDict and RedactorPhraseList</br>- Documented all of the above changes in README | Nov 2022 |
1228
1257
| 1.20 | - Added --verbose and --no-verbose command line options</br>- Changed entity restoration error from an exeption to at stops execution to a warning that restoration failed. | 16 Dec 2022 |
1229
1258
| 1.21 | - Add RedactorTokenMap to refactor token-map processing </br>- Added anonymiztion and redaction order debug to redactomatic </br>- Correct redactomatic bug that did not correctly track split names for anonymization </br>- Add protect_zones to Spacy redactor </br>- Refactor regex_utils and add search() </br>- Added indexed redaction labels to split spacy names.</br>- Refactor insertion of indexed labels to share common code </br>- add verbosity flag to test-redactomatic.sh | 14 Feb 2024 |
1230
-
| 1.22 | - Remove FAC, GPE, LANGUAGE, NORP, PRODUCT, EVENT, LAUGHTER, LAW, ORG, QUANTITY, WORK_OF_ART, ORDINAL from Level 3. </br>- Create level 4 which does what level 3 used to do. </br>- Create level 0 which simply redacts the Token Map and nothing else. </br>- Log Token Mappings in the output log. </br>- Defend existing labels correctly in TokenMap. </br>- Log changes made by the TokenMap. </br>- Make a modest attempt to defend dates and sums of money in the cardinal rule. </br>- TokenMaps are now case sensitive. Fix test case. </br>- Defend ordinals in the cardinal text rule. |
1259
+
| 1.22 | - Remove FAC, GPE, LANGUAGE, NORP, PRODUCT, EVENT, LAUGHTER, LAW, ORG, QUANTITY, WORK_OF_ART, ORDINAL from Level 3. </br>- Create level 4 which does what level 3 used to do. </br>- Create level 0 which simply redacts the Token Map and nothing else. </br>- Log Token Mappings in the output log. </br>- Defend existing labels correctly in TokenMap. </br>- Log changes made by the TokenMap. </br>- Make a modest attempt to defend dates and sums of money in the cardinal rule. </br>- TokenMaps are now case sensitive. Fix test case. </br>- Defend ordinals in the cardinal text rule. | 21 March 2024 |
parser.add_argument('--uppercase', required=False, action='store_true', help='converts all letters to uppercase')
45
46
parser.add_argument('--level', default=2, required=False, help='The redaction level. Choose 1,2, or 3 or a any custom level. Default is 2')
46
47
parser.add_argument('--seed', type=int, required=False, default=None, help='a seed value for anonymization random selection; default is None i.e. truly random.')
47
-
parser.add_argument('--rulefile', nargs="*", required=False, default=[], help='A list of filenames defining custom rules in YML or JSON. Add to or override default rules (see --defaultrules). These are globbable.')
48
+
parser.add_argument('--rulefile', nargs="*", required=False, default=[], help='A list of filenames defining custom rules in YML or JSON. Add to or override default rules (see --defaultrules). These are globbable.')
48
49
parser.add_argument('--regextest', required=False, default=False, action='store_true', help='Test the regular rexpressions defeind in the regex-test rules prior to any other processing.')
49
50
parser.add_argument('--testoutputfile', required=False, help='The file to save test results in.')
50
51
parser.add_argument('--chunksize', required=False, default=100000, type=int, help='The number of lines to read before processing a chunk.(default = 100000)' )
parser.add_argument('--traceback', action='store_true', default=False, help='Give traceback information when an error is thrown (default=False)')
56
57
parser.add_argument('-v','--verbose', action='store_true', default=True, help='Print progress of redaction to standard output as it occurs. Does not affect stderr. (default=True)')
57
58
parser.add_argument('--no-verbose', dest='verbose', action='store_false', help='Turn off --verbose')
59
+
parser.add_argument('-sd','--startdate', type=str, default=None, help='Start date for date-based processing (default=None)')
60
+
parser.add_argument('-ed','--enddate', type=str, default=None, help='End date for date-based processing (default=None)')
61
+
parser.add_argument('-co','--chunkoutstem', type=str, default=None, help='Base name for chunked output files when using date-based processing (default=None)')
62
+
parser.add_argument('-is','--instem', type=str, default=None, help='Base name for input files when using date-based processing (default=None)')
63
+
parser.add_argument('-os','--outstem', type=str, default=None, help='Base name for output files when using date-based processing (default=None)')
64
+
parser.add_argument('--chunkgather', type=int, default=None, help='Number of chunks to process before moving to final destination. If not specified, no chunking is performed.')
58
65
59
66
#version
60
67
parser.add_argument('--version', action='version', help='Print the version', version=f'redactomatic {__version__()}')
0 commit comments