Question about preproc.py, quote sub is different to fasttext get-wikimedia.sh

I think I found a bug in the `preproc.py` script for string normalization here: https://gist.github.com/bittlingmayer/7139a6a75ba0dbbc3a06325394ae3a13#file-ft_wiki_preproc-py-L17

![image](https://user-images.githubusercontent.com/480395/195101975-f2dbc0d1-e116-4746-8f59-c27018ac9033.png)

The Python script appears to replace the double-quote with a space, but the original sed surrounds the double quote with spaces. Is this an error?

Unrelated, (but since I have your attention ;), do you know whether the same preprocessing rules were used for the common crawl models? I'm asking specifically about the `cc.en.300.bin` fasttext model.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about preproc.py, quote sub is different to fasttext get-wikimedia.sh #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about preproc.py, quote sub is different to fasttext get-wikimedia.sh #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions