Skip to content

Conversation

@achalddave
Copy link

Some datasets (e.g., YFCC) have new lines in captions, which causes parquet's csv module to error by default. This PR allows passing --newlines-in-captions True to img2dataset, which will in turn tell parquet to allow newlines in CSV values.

The YFCC-15M descriptions can have new lines in the caption, which
causes parquet's csv module to error by default. This commit allows
passing --newlines-in-captions True to img2dataset, which will tell
parquet to allow newlines in CSV values.
@achalddave achalddave force-pushed the newlines-in-captions branch from 7820b66 to 0e15d4a Compare March 8, 2023 22:49
@rom1504
Copy link
Owner

rom1504 commented Apr 23, 2023

could you add an example of dataset for which this is needed please ?

@achalddave
Copy link
Author

I needed this for YFCC 100M - did you want that in the README/in the repo somewhere?

@rom1504
Copy link
Owner

rom1504 commented May 28, 2023

yes if you could add it in https://github.com/rom1504/img2dataset/tree/main/dataset_examples it would be great

@ldfandian
Copy link
Contributor

ldfandian commented Jul 3, 2023

I also need this~ (I have a crawler, which gives me many raw web image-text pairs with newline in the text title).
Looking forward to its being merged~ @achalddave

@rom1504
Copy link
Owner

rom1504 commented Jul 15, 2023

could you please rebase on head / resolve conflicts ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants