char-level Tokenizer.sequences_to_texts() insert additional SPACEs


- [x] Check that you are up-to-date with the master branch of keras-preprocessing. You can update with:
`pip install git+git://github.com/keras-team/keras-preprocessing.git --upgrade --no-deps`

- [x] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

**Describe the problem**.

```
from tensorflow.keras.preprocessing.text import Tokenizer

text = ['abc def']
tokenizer = Tokenizer(char_level=True, split='')
tokenizer.fit_on_texts(text)
sequence = tokenizer.texts_to_sequences(text)
text_after = tokenizer.sequences_to_texts(sequence)

print(text_after)
>>> ['a b c   d e f']
```

notice that `text_after` and `text` are different, additional SPACEs are inserted

**Describe the expected behavior**.

`text_after` should be same as `text`

I believe [this line](https://github.com/keras-team/keras-preprocessing/blob/6701f27afa62712b34a17d4b0ff879156b0c7937/keras_preprocessing/text.py#L363) is where the problem is, replacing:

`vect = ' '.join(vect)`

with

`vect = self.split.join(vect)`

will fix the bug in my mini case


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

char-level Tokenizer.sequences_to_texts() insert additional SPACEs #346

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

char-level Tokenizer.sequences_to_texts() insert additional SPACEs #346

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions