Description
hi and thx for yr great library.
i made a cli program to run it on my own texts.
i'm trying to add a subclass to it that enables me to feed it sentences that dont begin with initial capital letters and might begin with stars, bullets, etc. i made a subclass (modeled on your NewlineText) to modify the regexes in split_into_sentences()
, changing the lookahead search that mandates an initial capital letter after sentence end (splitters.py, line 45) to read r"\s+(?=[-•\w‘’“”'*\|/~\",])",
, and added a few more punctuation marks to the previous regexes (hypen, ellipses/triple periods).
it works if i manually generate a corpus and markov model from one of my texts, but not if i run my program using the subclass. one "sentence" will have a period in the middle of it and will continue printing text after it.
so i wanted to ask if there anything in the way that sentences are made from the markov model that would affect these modified regexes or disregard them? and is there a better way to go about modifying sentence endings than messing with split_into_sentences()
?
[sorry if its obvious in the code. i'm very much a novice with programming.]