Skip to content

Why tokenize twice? #73

@saeub

Description

@saeub

I'm trying to adapt TransformerSum to a non-English custom dataset and currently very confused about this code in extractive.py:

if tokenized:
src_txt = [
" ".join([token.text for token in sentence if str(token) != "."]) + "."
for sentence in input_sentences
]
else:
nlp = English()
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)
src_txt = [
" ".join([token.text for token in nlp(sentence) if str(token) != "."])
+ "."
for sentence in input_sentences
]

  • Why separate the words with spaces, when the resulting string is then tokenized using the tokenizer from the transformers library? I assume those tokenizers are not usually trained on pre-tokenized text, and neither are the pretrained models?
  • Why remove the space before "." characters, but not anywhere else?

Thanks for any explanations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions