-
Notifications
You must be signed in to change notification settings - Fork 60
Open
Description
I'm trying to adapt TransformerSum to a non-English custom dataset and currently very confused about this code in extractive.py:
TransformerSum/src/extractive.py
Lines 1093 to 1107 in 15bd11d
| if tokenized: | |
| src_txt = [ | |
| " ".join([token.text for token in sentence if str(token) != "."]) + "." | |
| for sentence in input_sentences | |
| ] | |
| else: | |
| nlp = English() | |
| sentencizer = nlp.create_pipe("sentencizer") | |
| nlp.add_pipe(sentencizer) | |
| src_txt = [ | |
| " ".join([token.text for token in nlp(sentence) if str(token) != "."]) | |
| + "." | |
| for sentence in input_sentences | |
| ] |
- Why separate the words with spaces, when the resulting string is then tokenized using the tokenizer from the transformers library? I assume those tokenizers are not usually trained on pre-tokenized text, and neither are the pretrained models?
- Why remove the space before
"."characters, but not anywhere else?
Thanks for any explanations.
Metadata
Metadata
Assignees
Labels
No labels