You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/bricks.rst
+12-1
Original file line number
Diff line number
Diff line change
@@ -250,6 +250,12 @@ for consideration as narrative text. The function performs the following checks
250
250
``cap_threshold=1.0``. You can also set the threshold by using the
251
251
``UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLD`` environment variable. The environment variable
252
252
takes precedence over the kwarg.
253
+
* If a the text contains too many non-alpha characters it is
254
+
not narrative text.
255
+
The default is to expect a minimum of 75% alpha characters
256
+
(not countings spaces). You can change the minimum value with the
257
+
``non_alpha_ratio`` kwarg or the ``UNSTRUCTURED_NARRATIVE_TEXT_NON_ALPHA_RATIO`` environment variable.
258
+
The environment variables takes precedence over the kwarg.
253
259
* The cap ratio test does not apply to text that is all uppercase.
254
260
255
261
@@ -280,9 +286,14 @@ for consideration as a title. The function performs the following checks:
280
286
281
287
* Empty text cannot be a title
282
288
* Text that is all numeric cannot be a title.
283
-
* If a title contains too many words it is not a title. The default max length is ``15``. You can change the max length with
289
+
* If a title contains too many words it is not a title. The default max length is ``12``. You can change the max length with
284
290
the ``title_max_word_length`` kwarg or the ``UNSTRUCTURED_TITLE_MAX_WORD_LENGTH`` environment variable. The environment
285
291
variable takes precedence over the kwarg.
292
+
* If a text contains too many non-alpha characters it is not a
293
+
title. The default is to expect a minimum of 75% alpha characters
294
+
(not countings spaces). You can change the minimum value with the
295
+
``non_alpha_ratio`` kwarg or the ``UNSTRUCTURED_TITLE_NON_ALPHA_RATIO`` environment variable.
296
+
The environment variables takes precedence over the kwarg.
286
297
* Narrative text must contain at least one English word (if ``language`` is set to "en")
287
298
* If a title contains more than one sentence that exceeds a certain length, it cannot be a title. Sentence length threshold is controlled by the ``sentence_min_length`` kwarg and defaults to 5.
288
299
* If a segment of text ends in a comma, it is not considered a potential title. This is to avoid salutations like "To My Dearest Friends," getting flagged as titles.
0 commit comments