You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+48-24
Original file line number
Diff line number
Diff line change
@@ -155,30 +155,31 @@ cargo run --release -- -l en -d ../texts/ extract-file >> file.en.txt
155
155
156
156
The following rules can be configured per language. Add a `<language>.toml` file in the `rules` directory to enable a new locale. Note that the `replacements` get applied before any other rules are checked.
| allowed_symbols_regex | Regex of allowed symbols or letters. Each character gets matched against this pattern. | String Array | not used
162
-
| broken_whitespace | Array of broken whitespaces. This could for example disallow two spaces following each other | String Array | all types of whitespaces allowed
163
-
| disallowed_symbols | Use `allowed_symbols_regex` instead. Array of disallowed symbols or letters. Only used when allowed_symbols_regex is not set or is an empty String. | String Array | all symbols allowed
164
-
| disallowed_words | Array of disallowed words. Prefer the blocklist approach when possible. | String Array | all words allowed
165
-
| even_symbols | Symbols that always need an even count | Char Array | []
166
-
| matching_symbols | Symbols that map to another | Array of matching configurations: each configuration is an Array of two values: `["match", "match"]`. See example below. | []
167
-
| max_word_count | Maximum number of words in a sentence | integer | 14
168
-
| may_end_with_colon | If a sentence can end with a : or not | boolean | false
169
-
| min_characters | Minimum of character occurrences | integer | 0
170
-
| max_characters | Maximum of character occurrences | integer | MAX
171
-
| min_trimmed_length | Minimum length of string after trimming | integer | 3
172
-
| min_word_count | Minimum number of words in a sentence | integer | 1
173
-
| needs_letter_start | If a sentence needs to start with a letter | boolean | true
174
-
| needs_punctuation_end | If a sentence needs to end with a punctuation | boolean | false
175
-
| needs_uppercase_start | If a sentence needs to start with an uppercase | boolean | false
176
-
| other_patterns | Regex to disallow anything else | Rust Regex Array | all other patterns allowed
177
-
| quote_start_with_letter | If a quote needs to start with a letter | boolean | true
178
-
| remove_brackets_list | Removes (possibly nested) user defined brackets and content inside them `(anything [else])` from the sentence before replacements and checking other rules | Array of matching brackets: each configuration is an Array of two values: `["opening_bracket", "closing_bracket"]`. See example below. | []
179
-
| replacements | Replaces abbreviations or other words according to configuration. This happens before any other rules are checked. | Array of replacement configurations: each configuration is an Array of two values: `["search", "replacement"]`. See example below. | nothing gets replaced
180
-
| segmenter | Segmenter to use for this language. See below for more information. | "python" | using `rust-punkt` by default
181
-
| stem_separator_regex | If given, splits words at the given characters to reach the stem words to check them again against the blacklist, e.g. prevents "Rust's" to pass if "Rust" is in the blacklist. | Simple regex of separators, e.g. for apostrophe `stem_separator_regex = "[']"` | ""
| allowed_symbols_regex | Regex of allowed symbols or letters. Each character gets matched against this pattern. | String Array | not used
162
+
| broken_whitespace | Array of broken whitespaces. This could for example disallow two spaces following each other | String Array | all types of whitespaces allowed
163
+
| disallowed_symbols | Use `allowed_symbols_regex` instead. Array of disallowed symbols or letters. Only used when allowed_symbols_regex is not set or is an empty String. | String Array | all symbols allowed
164
+
| disallowed_words | Array of disallowed words. Prefer the blocklist approach when possible. | String Array | all words allowed
165
+
| even_symbols | Symbols that always need an even count | Char Array | []
166
+
| matching_symbols | Symbols that map to another | Array of matching configurations: each configuration is an Array of two values: `["match", "match"]`. See example below. | []
167
+
| max_word_count | Maximum number of words in a sentence | integer | 14
168
+
| may_end_with_colon | If a sentence can end with a : or not | boolean | false
169
+
| min_characters | Minimum of character occurrences | integer | 0
170
+
| max_characters | Maximum of character occurrences | integer | MAX
171
+
| min_trimmed_length | Minimum length of string after trimming | integer | 3
172
+
| min_word_count | Minimum number of words in a sentence | integer | 1
173
+
| needs_letter_start | If a sentence needs to start with a letter | boolean | true
174
+
| needs_punctuation_end | If a sentence needs to end with a punctuation | boolean | false
175
+
| needs_uppercase_start | If a sentence needs to start with an uppercase | boolean | false
176
+
| other_patterns | Regex to disallow anything else | Rust Regex Array | all other patterns allowed
177
+
| quote_start_with_letter | If a quote needs to start with a letter | boolean | true
178
+
| remove_brackets_list | Removes (possibly nested) user defined brackets and content inside them `(anything [else])` from the sentence before replacements and checking other rules | Array of matching brackets: each configuration is an Array of two values: `["opening_bracket", "closing_bracket"]`. See example below. | []
179
+
| replacements | Replaces abbreviations or other words according to configuration. This happens before any other rules are checked. | Array of replacement configurations: each configuration is an Array of two values: `["search", "replacement"]`. See example below. | nothing gets replaced
180
+
| regex_replacement_list | Finds regex and makes replacements within found patterms. This happens before any other rules are checked. | Array of configurations: each configuration is an Array of three values: `["regex", "search", "replacement"]`. See example below. | nothing gets replaced
181
+
| segmenter | Segmenter to use for this language. See below for more information. | "python" | using `rust-punkt` by default
182
+
| stem_separator_regex | If given, splits words at the given characters to reach the stem words to check them again against the blacklist, e.g. prevents "Rust's" to pass if "Rust" is in the blacklist. | Simple regex of separators, e.g. for apostrophe `stem_separator_regex = "[']"` | ""
182
183
183
184
### Example for `matching_symbols`
184
185
@@ -239,6 +240,29 @@ Input: I am foo test a test
239
240
Output: I am hi a hi
240
241
```
241
242
243
+
### Example for `regex_replacement_list`
244
+
245
+
```
246
+
regex_replacement_list = [
247
+
# Split glued sentences
248
+
["\\ [a-z]{3,}\\.[A-Z][a-z]{2,}\\ ", ".", ". "],
249
+
250
+
# Split long sentences
251
+
["\\b(?:\\S+\\s+){15,}\\S+[.!?]", ", but ", ". But "],
252
+
]
253
+
```
254
+
255
+
This will find words that glue two sentences and will add a space to un-glue them.
256
+
And will split a long sentence in two smaller.
257
+
258
+
```
259
+
Input: A sentence.Glued to another.
260
+
Output: A sentence. Glued to another.
261
+
262
+
Input: A first part of a long sentence that would be rejected, but infact it could be used.
263
+
Output: A first part of a long sentence that would be rejected. But infact it could be used.
264
+
```
265
+
242
266
## Using disallowed words
243
267
244
268
In order to increase the quality of the final output, you might want to consider filtering out some words that are complex, too long or non-native.
0 commit comments