eisenzopf
diff --git a/‎README.md‎
Lines changed: 36 additions & 8 deletions b/‎README.md‎
Lines changed: 36 additions & 8 deletions
diff --git a/‎redactomatic.py‎
Lines changed: 46 additions & 8 deletions b/‎redactomatic.py‎
Lines changed: 46 additions & 8 deletions
@@ -172,6 +172,11 @@ usage: redactomatic.py
       [--verbose]
       [--no-verbose]
       [--traceback]
+      [-sd, --startdate]
+      [-ed, --enddate]
+      [-co, --chunkoutstem]
+      [-is, --instem]
+      [-os, --outstem]
 ```
 
 ### Command Line Parameters
@@ -200,11 +205,18 @@ usage: redactomatic.py
 | `--rulefile`                               | A list of filenames defining custom rules in YML or JSON. Add to or override default rules (see --defaultrules).  These are globbable.   | *OPTIONAL*       |
 | `--regextest`                              | Test the regular rexpressions defiend in the regex-test rules prior to any other processing.                                             | *OPTIONAL*       |
 | `--testoutputfile`                         | The file to save the regular expression test results in.                                                                                 | *OPTIONAL*       |
-| `--traceback`</br>`--no-traceback  `       | Give traceback information when an exceptin causes the program to halt.                                                                  | no-traceback     |
+| `--traceback`</br>`--no-traceback`         | Give traceback information when an exceptin causes the program to halt.                                                                  | no-traceback     |
 | `--version`                                | Print the version and exit                                                                                                               | *TERMINAL*       |
 | `--verbose`</br>`--no-verbose`             | Print the status of processing steps to standard output.                                                                                 | verbose          |
+| `-sd`,`--startdate`                        | Start date for date-based processing                                                                                                     | *OPTIONAL*       |
+| `-ed`,`--enddate`                          | End date for date-based processing                                                                                                       | *OPTIONAL*       |
+| `-co`,`--chunkoutstem`                     | Base name for chunked output files when using date-based processing                                                                      | *OPTIONAL*       |
+| `-is`,`--instem`                           | Base name for input files when using date-based processing                                                                               | *OPTIONAL*       |
+| `-os`,`--outstem`                          | Base name for output files when using date-based processing                                                                              | *OPTIONAL*       |
+| `--chunkgather`                            | The number of chunks to gather into separate chunked output files using the `outstem` as the base name.                           | *OPTIONAL*       |
 
-(\*) ** *The command must either specifiy the --header option or give the --column and --idcolumn options.* **
+
+d must either specifiy the --header option or give the --column and --idcolumn options.* **
 
 ### Example 1: Redact a text file with no header
 
@@ -254,6 +266,23 @@ python3 redactomatic.py --column 4 --idcolumn 1 --modality text --inputfile ./da
 python3 redactomatic.py --column 4 --idcolumn 1 --modality text --inputfile ./data/output.csv --outputfile output2.csv --anonymize --no-redact
 ```
 
+### Example 6: Using date-based filenames and gathered chunks
+
+In order to support date-based file naming, redactomatic supports the ability to take input and output file stems plus start date and end date postfixes and create the input and output file names from them.  It can also create a final output that is chopped into gathered chunks if you do not want to have a number of smaller files in the output rather than one large one.
+
+```sh
+python3 redactomatic.py --column 4 --idcolumn 1 --modality text --instem mydir/input -sd '2024-01-01' -ed '2024-01-31' --outstem myotherdir/ouput --chunkoutstem andanotherdir/chunk --chunksize 1000 --chunkgather 40  
+```
+
+The command above does the following:
+
+- Reads the input from the file `mydir/input_2024-01-01_2024-01-31.csv`
+- Reads this input file in chunks of 1000 lines
+- Appends these chunks into the output file `myotherdir/ouput_2024-01-01_2024-01-31.csv` as they are processed
+- Every 40 chunks, the output file is copied to `andanotherdir/chunk_2024-01-01_2024-01-31_{chunk_no}.csv`  where {chunk_no} will be 0,39,79,119 etc.
+
+When the process has completed the output file `myotherdir/ouput_2024-01-01_2024-01-31.csv` is deleted leaving just the chunked output files.
+
 ## Redaction Levels
 
 You can define the entities that will be redacted (and hence anonymized). By default four levels are defined '0', '1', '2', '3' and '4'.  Level '0' does no redaction. It can be used in cases where the input is alread redacted and translation of redaction tokens into standard formats are needed.  Level 1 means that Redactomatic will only use the machine learning NER parser, which captures many entities but is not reliable and does not match addresses, phone numbers, SSN, and other kinds of numbers that are probably important to recognize. Level 2 is the default level and matches most PII entities. However, it can miss numbers that aren't supported or are formatting in a way that Redactomatic hasn't seen before. Level '3' adds in Ordinals (e.g. '123 or one two three') so that orginal numbers are always redacted whether they are a recognized type or not.  It also adds in dates, percentages and times as there are also numeric items.  It does not redact cardinals (e.g. 'first', 'second', 'third').    Finally we have added an additional level '4' which  contains all the entities from the machine learning NER parser and also adds in Ordinals.  It is unlikely that level 4 will be helpful but it is included as the highest level of redaction possible. 
@@ -485,7 +514,7 @@ level:
 
 **Example level definition (YAML)**
 
-Any number of level definitions may be set.  The defalt configuration files contain three level keys '1', '2', '3' and '4', an extract of which is shown above.     The ` --level` option uses the entity list that is found in the relevant matching section.  Level keys do not need to be numeric.  You can add as many levels as you want.
+Any number of level definitions may be set.  The defalt configuration files contain five level keys '0', '1', '2', '3' and '4', an extract of which is shown above.     The ` --level` option uses the entity list that is found in the relevant matching section.  Level keys do not need to be numeric.  You can add as many levels as you want.
 
 To define a custom level you can add a custom configuraton file with its own level entry as shown below:
 
@@ -1227,7 +1256,8 @@ There are no known issues.
 | 1.19    | - Added default option to compile a single regex for a whole phrase list to make it more efficient to RedactorPhraseDict and RedactorPhraseList</br>- Added combine-sets parameter to support turning this off if required</br>- Added complete prematch and postmatch support for RedactorPhraseDict and RedactorPhraseList</br>- Added add-wordbreak parameter to RedactorPhraseDict and RedactorPhraseList</br>- Documented all of the above changes in README                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | Nov 2022     |
 | 1.20    | - Added --verbose and --no-verbose command line options</br>- Changed entity restoration error from an exeption to at stops execution to a warning that restoration failed.             | 16 Dec 2022  |
 | 1.21    | - Add RedactorTokenMap to refactor token-map processing </br>- Added anonymiztion and redaction order debug to redactomatic </br>- Correct redactomatic bug that did not correctly track split names for anonymization </br>- Add protect_zones to Spacy redactor </br>- Refactor regex_utils and add search() </br>- Added indexed redaction labels to split spacy names.</br>- Refactor insertion of indexed labels to share common code </br>- add verbosity flag to test-redactomatic.sh   | 14 Feb 2024  |
-| 1.22    | - Remove FAC, GPE, LANGUAGE, NORP, PRODUCT, EVENT, LAUGHTER, LAW, ORG, QUANTITY, WORK_OF_ART, ORDINAL from Level 3. </br>- Create level 4 which does what level 3 used to do. </br>- Create level 0 which simply redacts the Token Map and nothing else. </br>- Log Token Mappings in the output log. </br>- Defend existing labels correctly in TokenMap. </br>- Log changes made by the TokenMap. </br>- Make a modest attempt to defend dates and sums of money in the cardinal rule. </br>- TokenMaps are now case sensitive. Fix test case. </br>- Defend ordinals in the cardinal text rule. |
+| 1.22    | - Remove FAC, GPE, LANGUAGE, NORP, PRODUCT, EVENT, LAUGHTER, LAW, ORG, QUANTITY, WORK_OF_ART, ORDINAL from Level 3. </br>- Create level 4 which does what level 3 used to do. </br>- Create level 0 which simply redacts the Token Map and nothing else. </br>- Log Token Mappings in the output log. </br>- Defend existing labels correctly in TokenMap. </br>- Log changes made by the TokenMap. </br>- Make a modest attempt to defend dates and sums of money in the cardinal rule. </br>- TokenMaps are now case sensitive. Fix test case. </br>- Defend ordinals in the cardinal text rule. | 21 March 2024 |
+| 1.23    | - Add --startdate --enddate -chunkoutstem --instem --outstem options. | 10 June 2025 |
 
 ## License
 
@@ -1243,17 +1273,15 @@ Copyright (C) 2020, Jonathan Eisenzopf
 Copyright (c) 2020, Open source contributors.
 All rights reserved.
 
-
 ## Contributors
 
-Thanks to [@kavdev](https://github.com/kavdev) for reviewing the code and submitting bug fixes. </br>
-Thanks to [@wmjg-alt](https://github.com/wmjg-alt) for adding context to anonymization functions. </br>
+Thanks to [@kavdev](https://github.com/kavdev) for reviewing the code and submitting bug fixes.
+Thanks to [@wmjg-alt](https://github.com/wmjg-alt) for adding context to anonymization functions.
 Thanks to [@davidattwater](https://github.com/davidattwater) for refactoring the code to use a generic rules base and maintaining the code.
 
 ## Contribution
 
 Unless you explicitly state otherwise, any contribution intentionally submitted
 for inclusion in the work by you, as defined in the MIT license, shall
 be licensed as above, without any additional terms or conditions.
-
 Please see the [Contribution Guidelines](CONTRIBUTING.md).
@@ -10,9 +10,10 @@
 import os
 import traceback
 import redact
+import shutil
 
 def __version__():
-    return "1.22"
+    return "1.23"
 
 TOKENMAP_RULENAME="_TOKEN_MAP_"
 
@@ -44,7 +45,7 @@ def config_args(): # add --anonymize
     parser.add_argument('--uppercase', required=False, action='store_true', help='converts all letters to uppercase')
     parser.add_argument('--level', default=2, required=False, help='The redaction level. Choose 1,2, or 3 or a any custom level. Default is 2')
     parser.add_argument('--seed', type=int, required=False, default=None, help='a seed value for anonymization random selection; default is None i.e. truly random.')
-    parser.add_argument('--rulefile', nargs="*", required=False, default=[], help='A list of filenames defining custom rules in YML or JSON. Add to or override default rules (see --defaultrules).  These are globbable.')
+    parser.add_argument('--rulefile', nargs="*", required=False, default=[], help='A list of filenames defining custom rules in YML or JSON. Add to or override default rules (see --defaultrules). These are globbable.')
     parser.add_argument('--regextest', required=False, default=False, action='store_true', help='Test the regular rexpressions defeind in the regex-test rules prior to any other processing.')
     parser.add_argument('--testoutputfile', required=False, help='The file to save test results in.')
     parser.add_argument('--chunksize', required=False, default=100000, type=int, help='The number of lines to read before processing a chunk.(default = 100000)' )
@@ -55,6 +56,12 @@ def config_args(): # add --anonymize
     parser.add_argument('--traceback', action='store_true', default=False, help='Give traceback information when an error is thrown (default=False)')
     parser.add_argument('-v','--verbose', action='store_true', default=True, help='Print progress of redaction to standard output as it occurs. Does not affect stderr. (default=True)')
     parser.add_argument('--no-verbose',   dest='verbose', action='store_false', help='Turn off --verbose')
+    parser.add_argument('-sd','--startdate', type=str, default=None, help='Start date for date-based processing (default=None)')
+    parser.add_argument('-ed','--enddate', type=str, default=None, help='End date for date-based processing (default=None)')
+    parser.add_argument('-co','--chunkoutstem', type=str, default=None, help='Base name for chunked output files when using date-based processing (default=None)')
+    parser.add_argument('-is','--instem', type=str, default=None, help='Base name for input files when using date-based processing (default=None)')
+    parser.add_argument('-os','--outstem', type=str, default=None, help='Base name for output files when using date-based processing (default=None)')
+    parser.add_argument('--chunkgather', type=int, default=None, help='Number of chunks to process before moving to final destination. If not specified, no chunking is performed.')
 
     #version
     parser.add_argument('--version', action='version', help='Print the version', version=f'redactomatic {__version__()}')
@@ -77,8 +84,23 @@ def config_args(): # add --anonymize
         if (not _args.header):
             if (not _args.column): _err_list.append("ERROR: The --column option is required when --header is False.")
             if (not _args.idcolumn): _err_list.append("ERROR: The --idcolumn option is required when --header is False.")
-        if (not _args.inputfile): _err_list.append("ERROR: The --inputfile option is required.")
-        if (not _args.outputfile): _err_list.append("ERROR: The --outputfile option is required.")
+        
+        # Check date-based processing requirements
+        if (_args.startdate is not None or _args.enddate is not None):
+            if _args.startdate is None: _err_list.append("ERROR: The --startdate option is required when using date-based processing.")
+            if _args.enddate is None: _err_list.append("ERROR: The --enddate option is required when using date-based processing.")
+            if _args.chunkgather is not None and _args.chunkoutstem is None: _err_list.append("ERROR: The --chunkoutstem option is required when using chunked output.")
+            if _args.instem is None: _err_list.append("ERROR: The --instem option is required when using date-based processing.")
+            if _args.outstem is None: _err_list.append("ERROR: The --outstem option is required when using date-based processing.")
+            
+            # Construct date-based filenames
+            _args.inputfile = [f"{_args.instem}_{_args.startdate}_{_args.enddate}.csv"]
+            _args.outputfile = f"{_args.outstem}_{_args.startdate}_{_args.enddate}.csv"
+            _args.chunkoutstem = f"{_args.chunkoutstem}_{_args.startdate}_{_args.enddate}"
+        else:
+            if (not _args.inputfile): _err_list.append("ERROR: The --inputfile option is required when not using date-based processing.")
+            if (not _args.outputfile): _err_list.append("ERROR: The --outputfile option is required when not using date-based processing.")
+        
         if (not _args.modality): _err_list.append("ERROR: The --modality option is required.")
     if _err_list:
         parser.error("\n".join(_err_list))
@@ -229,12 +251,22 @@ def main(args):
 
         for file in args.inputfile:
             if (args.verbose): print("Loading datafile " + file + "...")
-            df_iter = pd.read_csv(file,chunksize=args.chunksize,header=(0 if args.header else None),dtype=str, keep_default_na=False)
+            df_iter = pd.read_csv(file, chunksize=args.chunksize, header=(0 if args.header else None), dtype=str, keep_default_na=False)
+            
             for df in df_iter:
                 redactomatic.process(df)
-                if (args.verbose): print("Writing outfile ",args.outputfile, "chunk ",chunk)
-                if chunk==0: df.to_csv(args.outputfile, index=False, header=args.header)
-                else: df.to_csv(args.outputfile, mode='a', header=False, index=False)
+                if (args.verbose): print("Writing outfile ", args.outputfile, "chunk ", chunk)
+                
+                if chunk == 0: 
+                    df.to_csv(args.outputfile, index=False, header=args.header)
+                elif args.chunkgather is not None and chunk % args.chunkgather == 0:
+                    # Move current output to chunked destination and start new output file
+                    final_file = f"{args.chunkoutstem}_{chunk-1}.csv"
+                    shutil.move(args.outputfile, final_file)
+                    if (args.verbose): print(f"{args.outputfile} moved to {final_file}")
+                    df.to_csv(args.outputfile, index=False, header=args.header)
+                else: 
+                    df.to_csv(args.outputfile, mode='a', header=False, index=False)
 
                 #Quit if the chunklimit has been reached.
                 if (args.chunklimit is not None) and (chunk+1>=args.chunklimit):
@@ -243,6 +275,12 @@ def main(args):
 
                 chunk=chunk+1
 
+            # Move final chunk to destination if using chunked output
+            if args.chunkgather is not None:
+                final_file = f"{args.chunkoutstem}_{chunk-1}.csv"
+                shutil.move(args.outputfile, final_file)
+                if (args.verbose): print(f"{args.outputfile} moved to {final_file}")
+
         # write audit log
         if args.log:
             if (args.verbose): print("Writing logfile", args.log)