Skip to content

Commit 974f800

Browse files
committed
Release 1.23
1 parent c20483f commit 974f800

9 files changed

Lines changed: 189 additions & 17 deletions

README.md

Lines changed: 36 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -172,6 +172,11 @@ usage: redactomatic.py
172172
[--verbose]
173173
[--no-verbose]
174174
[--traceback]
175+
[-sd, --startdate]
176+
[-ed, --enddate]
177+
[-co, --chunkoutstem]
178+
[-is, --instem]
179+
[-os, --outstem]
175180
```
176181

177182
### Command Line Parameters
@@ -200,11 +205,18 @@ usage: redactomatic.py
200205
| `--rulefile` | A list of filenames defining custom rules in YML or JSON. Add to or override default rules (see --defaultrules).  These are globbable. | *OPTIONAL* |
201206
| `--regextest` | Test the regular rexpressions defiend in the regex-test rules prior to any other processing. | *OPTIONAL* |
202207
| `--testoutputfile` | The file to save the regular expression test results in. | *OPTIONAL* |
203-
| `--traceback`</br>`--no-traceback ` | Give traceback information when an exceptin causes the program to halt. | no-traceback |
208+
| `--traceback`</br>`--no-traceback` | Give traceback information when an exceptin causes the program to halt. | no-traceback |
204209
| `--version` | Print the version and exit | *TERMINAL* |
205210
| `--verbose`</br>`--no-verbose` | Print the status of processing steps to standard output. | verbose |
211+
| `-sd`,`--startdate` | Start date for date-based processing | *OPTIONAL* |
212+
| `-ed`,`--enddate` | End date for date-based processing | *OPTIONAL* |
213+
| `-co`,`--chunkoutstem` | Base name for chunked output files when using date-based processing | *OPTIONAL* |
214+
| `-is`,`--instem` | Base name for input files when using date-based processing | *OPTIONAL* |
215+
| `-os`,`--outstem` | Base name for output files when using date-based processing | *OPTIONAL* |
216+
| `--chunkgather` | The number of chunks to gather into separate chunked output files using the `outstem` as the base name. | *OPTIONAL* |
206217

207-
(\*) ** *The command must either specifiy the --header option or give the --column and --idcolumn options.* **
218+
219+
d must either specifiy the --header option or give the --column and --idcolumn options.* **
208220

209221
### Example 1: Redact a text file with no header
210222

@@ -254,6 +266,23 @@ python3 redactomatic.py --column 4 --idcolumn 1 --modality text --inputfile ./da
254266
python3 redactomatic.py --column 4 --idcolumn 1 --modality text --inputfile ./data/output.csv --outputfile output2.csv --anonymize --no-redact
255267
```
256268

269+
### Example 6: Using date-based filenames and gathered chunks
270+
271+
In order to support date-based file naming, redactomatic supports the ability to take input and output file stems plus start date and end date postfixes and create the input and output file names from them. It can also create a final output that is chopped into gathered chunks if you do not want to have a number of smaller files in the output rather than one large one.
272+
273+
```sh
274+
python3 redactomatic.py --column 4 --idcolumn 1 --modality text --instem mydir/input -sd '2024-01-01' -ed '2024-01-31' --outstem myotherdir/ouput --chunkoutstem andanotherdir/chunk --chunksize 1000 --chunkgather 40
275+
```
276+
277+
The command above does the following:
278+
279+
- Reads the input from the file `mydir/input_2024-01-01_2024-01-31.csv`
280+
- Reads this input file in chunks of 1000 lines
281+
- Appends these chunks into the output file `myotherdir/ouput_2024-01-01_2024-01-31.csv` as they are processed
282+
- Every 40 chunks, the output file is copied to `andanotherdir/chunk_2024-01-01_2024-01-31_{chunk_no}.csv` where {chunk_no} will be 0,39,79,119 etc.
283+
284+
When the process has completed the output file `myotherdir/ouput_2024-01-01_2024-01-31.csv` is deleted leaving just the chunked output files.
285+
257286
## Redaction Levels
258287

259288
You can define the entities that will be redacted (and hence anonymized). By default four levels are defined '0', '1', '2', '3' and '4'. Level '0' does no redaction. It can be used in cases where the input is alread redacted and translation of redaction tokens into standard formats are needed. Level 1 means that Redactomatic will only use the machine learning NER parser, which captures many entities but is not reliable and does not match addresses, phone numbers, SSN, and other kinds of numbers that are probably important to recognize. Level 2 is the default level and matches most PII entities. However, it can miss numbers that aren't supported or are formatting in a way that Redactomatic hasn't seen before. Level '3' adds in Ordinals (e.g. '123 or one two three') so that orginal numbers are always redacted whether they are a recognized type or not. It also adds in dates, percentages and times as there are also numeric items. It does not redact cardinals (e.g. 'first', 'second', 'third'). Finally we have added an additional level '4' which contains all the entities from the machine learning NER parser and also adds in Ordinals. It is unlikely that level 4 will be helpful but it is included as the highest level of redaction possible.
@@ -485,7 +514,7 @@ level:
485514

486515
**Example level definition (YAML)**
487516

488-
Any number of level definitions may be set. The defalt configuration files contain three level keys '1', '2', '3' and '4', an extract of which is shown above. The ` --level` option uses the entity list that is found in the relevant matching section. Level keys do not need to be numeric. You can add as many levels as you want.
517+
Any number of level definitions may be set. The defalt configuration files contain five level keys '0', '1', '2', '3' and '4', an extract of which is shown above. The ` --level` option uses the entity list that is found in the relevant matching section. Level keys do not need to be numeric. You can add as many levels as you want.
489518

490519
To define a custom level you can add a custom configuraton file with its own level entry as shown below:
491520

@@ -1227,7 +1256,8 @@ There are no known issues.
12271256
| 1.19 | - Added default option to compile a single regex for a whole phrase list to make it more efficient to RedactorPhraseDict and RedactorPhraseList</br>- Added combine-sets parameter to support turning this off if required</br>- Added complete prematch and postmatch support for RedactorPhraseDict and RedactorPhraseList</br>- Added add-wordbreak parameter to RedactorPhraseDict and RedactorPhraseList</br>- Documented all of the above changes in README | Nov 2022 |
12281257
| 1.20 | - Added --verbose and --no-verbose command line options</br>- Changed entity restoration error from an exeption to at stops execution to a warning that restoration failed. | 16 Dec 2022 |
12291258
| 1.21 | - Add RedactorTokenMap to refactor token-map processing </br>- Added anonymiztion and redaction order debug to redactomatic </br>- Correct redactomatic bug that did not correctly track split names for anonymization </br>- Add protect_zones to Spacy redactor </br>- Refactor regex_utils and add search() </br>- Added indexed redaction labels to split spacy names.</br>- Refactor insertion of indexed labels to share common code </br>- add verbosity flag to test-redactomatic.sh | 14 Feb 2024 |
1230-
| 1.22 | - Remove FAC, GPE, LANGUAGE, NORP, PRODUCT, EVENT, LAUGHTER, LAW, ORG, QUANTITY, WORK_OF_ART, ORDINAL from Level 3. </br>- Create level 4 which does what level 3 used to do. </br>- Create level 0 which simply redacts the Token Map and nothing else. </br>- Log Token Mappings in the output log. </br>- Defend existing labels correctly in TokenMap. </br>- Log changes made by the TokenMap. </br>- Make a modest attempt to defend dates and sums of money in the cardinal rule. </br>- TokenMaps are now case sensitive. Fix test case. </br>- Defend ordinals in the cardinal text rule. |
1259+
| 1.22 | - Remove FAC, GPE, LANGUAGE, NORP, PRODUCT, EVENT, LAUGHTER, LAW, ORG, QUANTITY, WORK_OF_ART, ORDINAL from Level 3. </br>- Create level 4 which does what level 3 used to do. </br>- Create level 0 which simply redacts the Token Map and nothing else. </br>- Log Token Mappings in the output log. </br>- Defend existing labels correctly in TokenMap. </br>- Log changes made by the TokenMap. </br>- Make a modest attempt to defend dates and sums of money in the cardinal rule. </br>- TokenMaps are now case sensitive. Fix test case. </br>- Defend ordinals in the cardinal text rule. | 21 March 2024 |
1260+
| 1.23 | - Add --startdate --enddate -chunkoutstem --instem --outstem options. | 10 June 2025 |
12311261

12321262
## License
12331263

@@ -1243,17 +1273,15 @@ Copyright (C) 2020, Jonathan Eisenzopf
12431273
Copyright (c) 2020, Open source contributors.
12441274
All rights reserved.
12451275

1246-
12471276
## Contributors
12481277

1249-
Thanks to [@kavdev](https://github.com/kavdev) for reviewing the code and submitting bug fixes. </br>
1250-
Thanks to [@wmjg-alt](https://github.com/wmjg-alt) for adding context to anonymization functions. </br>
1278+
Thanks to [@kavdev](https://github.com/kavdev) for reviewing the code and submitting bug fixes.
1279+
Thanks to [@wmjg-alt](https://github.com/wmjg-alt) for adding context to anonymization functions.
12511280
Thanks to [@davidattwater](https://github.com/davidattwater) for refactoring the code to use a generic rules base and maintaining the code.
12521281

12531282
## Contribution
12541283

12551284
Unless you explicitly state otherwise, any contribution intentionally submitted
12561285
for inclusion in the work by you, as defined in the MIT license, shall
12571286
be licensed as above, without any additional terms or conditions.
1258-
12591287
Please see the [Contribution Guidelines](CONTRIBUTING.md).

redactomatic.py

Lines changed: 46 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,10 @@
1010
import os
1111
import traceback
1212
import redact
13+
import shutil
1314

1415
def __version__():
15-
return "1.22"
16+
return "1.23"
1617

1718
TOKENMAP_RULENAME="_TOKEN_MAP_"
1819

@@ -44,7 +45,7 @@ def config_args(): # add --anonymize
4445
parser.add_argument('--uppercase', required=False, action='store_true', help='converts all letters to uppercase')
4546
parser.add_argument('--level', default=2, required=False, help='The redaction level. Choose 1,2, or 3 or a any custom level. Default is 2')
4647
parser.add_argument('--seed', type=int, required=False, default=None, help='a seed value for anonymization random selection; default is None i.e. truly random.')
47-
parser.add_argument('--rulefile', nargs="*", required=False, default=[], help='A list of filenames defining custom rules in YML or JSON. Add to or override default rules (see --defaultrules).  These are globbable.')
48+
parser.add_argument('--rulefile', nargs="*", required=False, default=[], help='A list of filenames defining custom rules in YML or JSON. Add to or override default rules (see --defaultrules). These are globbable.')
4849
parser.add_argument('--regextest', required=False, default=False, action='store_true', help='Test the regular rexpressions defeind in the regex-test rules prior to any other processing.')
4950
parser.add_argument('--testoutputfile', required=False, help='The file to save test results in.')
5051
parser.add_argument('--chunksize', required=False, default=100000, type=int, help='The number of lines to read before processing a chunk.(default = 100000)' )
@@ -55,6 +56,12 @@ def config_args(): # add --anonymize
5556
parser.add_argument('--traceback', action='store_true', default=False, help='Give traceback information when an error is thrown (default=False)')
5657
parser.add_argument('-v','--verbose', action='store_true', default=True, help='Print progress of redaction to standard output as it occurs. Does not affect stderr. (default=True)')
5758
parser.add_argument('--no-verbose', dest='verbose', action='store_false', help='Turn off --verbose')
59+
parser.add_argument('-sd','--startdate', type=str, default=None, help='Start date for date-based processing (default=None)')
60+
parser.add_argument('-ed','--enddate', type=str, default=None, help='End date for date-based processing (default=None)')
61+
parser.add_argument('-co','--chunkoutstem', type=str, default=None, help='Base name for chunked output files when using date-based processing (default=None)')
62+
parser.add_argument('-is','--instem', type=str, default=None, help='Base name for input files when using date-based processing (default=None)')
63+
parser.add_argument('-os','--outstem', type=str, default=None, help='Base name for output files when using date-based processing (default=None)')
64+
parser.add_argument('--chunkgather', type=int, default=None, help='Number of chunks to process before moving to final destination. If not specified, no chunking is performed.')
5865

5966
#version
6067
parser.add_argument('--version', action='version', help='Print the version', version=f'redactomatic {__version__()}')
@@ -77,8 +84,23 @@ def config_args(): # add --anonymize
7784
if (not _args.header):
7885
if (not _args.column): _err_list.append("ERROR: The --column option is required when --header is False.")
7986
if (not _args.idcolumn): _err_list.append("ERROR: The --idcolumn option is required when --header is False.")
80-
if (not _args.inputfile): _err_list.append("ERROR: The --inputfile option is required.")
81-
if (not _args.outputfile): _err_list.append("ERROR: The --outputfile option is required.")
87+
88+
# Check date-based processing requirements
89+
if (_args.startdate is not None or _args.enddate is not None):
90+
if _args.startdate is None: _err_list.append("ERROR: The --startdate option is required when using date-based processing.")
91+
if _args.enddate is None: _err_list.append("ERROR: The --enddate option is required when using date-based processing.")
92+
if _args.chunkgather is not None and _args.chunkoutstem is None: _err_list.append("ERROR: The --chunkoutstem option is required when using chunked output.")
93+
if _args.instem is None: _err_list.append("ERROR: The --instem option is required when using date-based processing.")
94+
if _args.outstem is None: _err_list.append("ERROR: The --outstem option is required when using date-based processing.")
95+
96+
# Construct date-based filenames
97+
_args.inputfile = [f"{_args.instem}_{_args.startdate}_{_args.enddate}.csv"]
98+
_args.outputfile = f"{_args.outstem}_{_args.startdate}_{_args.enddate}.csv"
99+
_args.chunkoutstem = f"{_args.chunkoutstem}_{_args.startdate}_{_args.enddate}"
100+
else:
101+
if (not _args.inputfile): _err_list.append("ERROR: The --inputfile option is required when not using date-based processing.")
102+
if (not _args.outputfile): _err_list.append("ERROR: The --outputfile option is required when not using date-based processing.")
103+
82104
if (not _args.modality): _err_list.append("ERROR: The --modality option is required.")
83105
if _err_list:
84106
parser.error("\n".join(_err_list))
@@ -229,12 +251,22 @@ def main(args):
229251

230252
for file in args.inputfile:
231253
if (args.verbose): print("Loading datafile " + file + "...")
232-
df_iter = pd.read_csv(file,chunksize=args.chunksize,header=(0 if args.header else None),dtype=str, keep_default_na=False)
254+
df_iter = pd.read_csv(file, chunksize=args.chunksize, header=(0 if args.header else None), dtype=str, keep_default_na=False)
255+
233256
for df in df_iter:
234257
redactomatic.process(df)
235-
if (args.verbose): print("Writing outfile ",args.outputfile, "chunk ",chunk)
236-
if chunk==0: df.to_csv(args.outputfile, index=False, header=args.header)
237-
else: df.to_csv(args.outputfile, mode='a', header=False, index=False)
258+
if (args.verbose): print("Writing outfile ", args.outputfile, "chunk ", chunk)
259+
260+
if chunk == 0:
261+
df.to_csv(args.outputfile, index=False, header=args.header)
262+
elif args.chunkgather is not None and chunk % args.chunkgather == 0:
263+
# Move current output to chunked destination and start new output file
264+
final_file = f"{args.chunkoutstem}_{chunk-1}.csv"
265+
shutil.move(args.outputfile, final_file)
266+
if (args.verbose): print(f"{args.outputfile} moved to {final_file}")
267+
df.to_csv(args.outputfile, index=False, header=args.header)
268+
else:
269+
df.to_csv(args.outputfile, mode='a', header=False, index=False)
238270

239271
#Quit if the chunklimit has been reached.
240272
if (args.chunklimit is not None) and (chunk+1>=args.chunklimit):
@@ -243,6 +275,12 @@ def main(args):
243275

244276
chunk=chunk+1
245277

278+
# Move final chunk to destination if using chunked output
279+
if args.chunkgather is not None:
280+
final_file = f"{args.chunkoutstem}_{chunk-1}.csv"
281+
shutil.move(args.outputfile, final_file)
282+
if (args.verbose): print(f"{args.outputfile} moved to {final_file}")
283+
246284
# write audit log
247285
if args.log:
248286
if (args.verbose): print("Writing logfile", args.log)

0 commit comments

Comments
 (0)