You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+24-9Lines changed: 24 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,7 +28,7 @@ As Redactomatic processes each row of an input CSV file, it replaces each recogn
28
28
29
29
### Anonymization
30
30
31
-
If the optional command-line `--anonymize` switch is included, Redactomatic will replace all entity type tags with a randomized value.
31
+
If the optional command-line `--anonymize` switch is included, Redactomatic will replace all entity type tags with a randomized value. By default, subsequent occurrences of the same entities with the same index number in a given conversation will be assigned the same value. This helps the anonymized conversation retain continuity.
32
32
33
33
Anonymization functions are supplied and you can add specific ones if required. By default alpha-numerical entity tags are anonymized using a random number/text generator based on patterns (regex).
34
34
@@ -73,7 +73,7 @@ entities:
73
73
model-class: anonymize.AnonRestoreEntityText
74
74
```
75
75
76
-
If you are interested in how this is implemented, notice how the definition of the *\_IGNORE\_* rule uses the redadctor `redact.RedactorPhraseList` to first redact the words to be ingored, and then uses the anonymizer `anonymize.AnonRestoreEntityText `to restore the text again. In addition to this entityalso needs to be added to the *always-anonymize* section of the configuration to ensure that it is anonymized even when the -`--anonymize` option is not set. Each of these steps are explained in detail later in this document.
76
+
If you are interested in how this is implemented, notice how the definition of the *\_IGNORE\_* rule uses the redadctor `redact.RedactorPhraseList` to first redact the words to be ingored, and then uses the anonymizer `anonymize.AnonRestoreEntityText `to restore the text again. In addition to this the entity also needs to be added to the *always-anonymize* section of the configuration to ensure that it is anonymized even when the -`--anonymize` option is not set. Each of these steps are explained in detail later in this document.
77
77
78
78
## Installation
79
79
@@ -361,7 +361,7 @@ always-anonymize:
361
361
362
362
The `always-anonymize` section lists entities that are anonymized even if the `--anonymize` flag is not set. This allows entities that catch text to be ignored to restore them afterwards even if anonymization is not performed.
363
363
364
-
This section is optional and if it is omitted then the rule shown above is implemented by default. This is to provide backwards compatibility. It is recommended that this section is included for clarity. If this section is defined then it overrides the default. This means that you should explicitly include the *_IGNORE_* entity if you with to use this entity to protect and restore text.
364
+
This section is optional and if it is omitted then the rule shown above is implemented by default. This is to provide backwards compatibility. It is recommended that this section is included for clarity. If this section is defined then it overrides the default. This means that you should explicitly include the *\_IGNORE\_* entity if you with to use this entity to protect and restore text.
365
365
366
366
The always-anonymize section can be used to anonymize entities with any kind of anonymizer defined. This feature can be used for things other than restoring ignored text. You can also have multiple entities in this section if desired.
367
367
@@ -408,7 +408,7 @@ You may be wondering why this is neccessary and how the redacted text can contai
408
408
409
409
* The label was generated by special redactor processes such as `_SPACY_` that add multiple label types and you want to share anonymizer rules for these labels.
410
410
411
-
If redactomatic is expecting to anonymize and entity and does not find an entry for it in the anon-map then it will assume that the entity maps to itself. However if an entity is defined in the key of the anon-map then only the entities found in the map will use the anonymizer defined for that entity.
411
+
If redactomatic is expecting to anonymize an entity and does not find an entry for it in the anon-map then it will assume that the entity maps to itself. However if an entity is defined in the key of the anon-map then only the entities found in the map will use the anonymizer defined for that entity.
412
412
413
413
```
414
414
anon-map:
@@ -607,15 +607,15 @@ Redactomatic has three built-in Redactor classes. New redactors can be added by
607
607
redactor:
608
608
model-class: redact.RedactorRegex
609
609
text:
610
-
regex: '\d{1-3}\.com'
610
+
regex: ['\d{1-3}\.com']
611
611
regex-id: my-rule-id
612
612
group: my-named-group
613
613
flags: [ ASCII, IGNORECASE, ... ]
614
614
voice:
615
615
...
616
616
```
617
617
618
-
The `redact.RedactorRegex` class uses a regular expression to match the entity. The regular expression can be specified via a `regex ` inline pattern, or be a shared rule with the `regex-id` key in the `regex `section. The whole regex pattern must match part or all of the phrase. Then the matching part of the phrase will be redacted with the redaction label (e.g. [MYDOMAIN-23] ). It is possible to redact only part of the matching area of the phrase but specifying the `group ` parameter. This can be an integer group number or a named group (using PCRE naming).
618
+
The `redact.RedactorRegex` class uses a regular expression to match the entity. The regular expression can be specified via a set of `regex ` inline patterns, or be a shared rule with the `regex-id` key in the `regex `section. The whole regex pattern must match part or all of the phrase. Then the matching part of the phrase will be redacted with the redaction label (e.g. [MYDOMAIN-23] ). It is possible to redact only part of the matching area of the phrase but specifying the `group ` parameter. This can be an integer group number or a named group (using PCRE naming).
619
619
620
620
Flags for the regular expression match can be specified via the `flags`value. This is a list of items as given below . By default [ IGNORECASE ] is used.
621
621
@@ -631,7 +631,7 @@ Flags for the regular expression match can be specified via the `flags`value. T
631
631
632
632
- LOCALE, L
633
633
634
-
It is possible to specify more than one regular expression for the redactor. If a list of regular expressions is specified then the redactor will attempt to match the given text against each of the patterns in turn. The matching is done in the order that the list is defined and any matching text is redacted once it is found. Matching text does not stop any subsequent patterns from also being matched on the text. For example if a pair of patterns is specified then a given text may match one of the patters in one part of the text and the other pattern in another part of the same text. The two matching sections cannot overlap.
634
+
Redactomatic expects to be given a list of regular expression for the redactor. If a list of regular expressions is specified then the redactor will attempt to match the given text against each of the patterns in turn. The matching is done in the order that the list is defined and any matching text is redacted once it is found. Matching text does not stop any subsequent patterns from also being matched on the text. For example if a pair of patterns is specified then a given text may match one of the patters in one part of the text and the other pattern in another part of the same text. The two matching sections cannot overlap.
635
635
636
636
#### redact.RedactorPhraseList
637
637
@@ -716,22 +716,37 @@ The `redact.RedactorSpacy` class implements the redaction of text using the Spac
716
716
717
717
Redactomatic currently has several built-in anonymizer classes. There are four generic anonymizers and a number of custom anonymizers for specific entities.
718
718
719
+
### Turning off persistence
720
+
721
+
The *persist* rule is universal to all anonymizer classes. This rule will also be inherited by any custom anonymizers that you add.
722
+
723
+
```
724
+
...
725
+
anonymizer:
726
+
model-class: ..all-model-classes..
727
+
persist: False
728
+
```
729
+
730
+
By default the *persist* rule have the value *True* and does not need to be defined. With this default value then replacements of entites with the same index number will be anonymized with the same value. For example if the redaction label [Name-99] will always be anonymized with the first randome value assigned to it any given conversation regardless of how many times it occurs.
731
+
732
+
If the persist value is set to False then this entity will always be given a new random value even if the same index is repeated through the conversation. This can be helpful where redacting things that are too generic to reliably be the same entity. For example imagine a redaction rule that redacts all isolated digits. It may not be desirable to anonymize all redacted digits with the same anonymized digit.
733
+
719
734
#### anonomizer.AnonRegex
720
735
721
736
```
722
737
...
723
738
anonymizer:
724
739
model-class: anonymize.AnonRegex
725
740
text:
726
-
regex: '[a-z]{0-16}\.com'
741
+
regex: ['[a-z]{0-16}\.com']
727
742
regex-id: my-rule-id
728
743
limit: 10
729
744
flags: [ IGNORECASE ]
730
745
voice:
731
746
...
732
747
```
733
748
734
-
The `anonomizer.AnonRegex` class is used to generate random text strings using a regular expression as a generative grammar. The regular expressions can be expressed inline via the `regex` parameter or by reference rules in the `regex `section using the `regex-id` parameter. The `limit `parameter defines the maximum number of repeats that a repeating pattern will be permitted to follow before terminating. This prevents infinite loops and can be used to limit computationally costly patterns. This is set to 10 by default. For more details see [xeger PyPI](https://pypi.org/project/xeger/). The `flags `parameter behaves as described for `redact.RedactorRegex`. This class does not support PCRE.
749
+
The `anonomizer.AnonRegex` class is used to generate random text strings using a regular expression as a generative grammar. The regular expressions can be expressed inline via the `regex` parameter or by reference rules in the `regex `section using the `regex-id` parameter. Both are shown in the xample above but only define one of these in each anonymizer. The `limit `parameter defines the maximum number of repeats that a repeating pattern will be permitted to follow before terminating. This prevents infinite loops and can be used to limit computationally costly patterns. This is set to 10 by default. For more details see [xeger PyPI](https://pypi.org/project/xeger/). The `flags `parameter behaves as described for `redact.RedactorRegex`. This class does not support PCRE.
0 commit comments