Skip to content

[WIP] Add BEST I corpus reader#446

Closed
wannaphong wants to merge 2 commits intodevfrom
add-best-corpus-reader
Closed

[WIP] Add BEST I corpus reader#446
wannaphong wants to merge 2 commits intodevfrom
add-best-corpus-reader

Conversation

@wannaphong
Copy link
Copy Markdown
Member

@wannaphong wannaphong commented Jun 28, 2020

I wrote a the BEST I corpus reader. It's can read BEST I corpus. It's return list words or CoNLL for NER.

BEST I Corpus : nectec.or.th/corpus/index.php?league=pm

TODO

  • Add Test

  • Add docs

@pep8speaks
Copy link
Copy Markdown

pep8speaks commented Jun 28, 2020

Hello @wannaphong! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 79:1: E402 module level import not at top of file

Line 49:80: E501 line too long (97 > 79 characters)
Line 66:1: W293 blank line contains whitespace

Comment last updated at 2020-12-29 12:15:43 UTC

@wannaphong wannaphong added this to the 2.3 milestone Jun 28, 2020
@coveralls
Copy link
Copy Markdown

coveralls commented Jun 28, 2020

Coverage Status

Coverage increased (+0.9%) to 95.871% when pulling ead62c8 on add-best-corpus-reader into e69a90e on dev.

@p16i
Copy link
Copy Markdown
Contributor

p16i commented Jun 28, 2020

How would you bypass the login? I thought about doing this but I'm not sure how I should overcome the login.

Is it possible to download and host the data else where?

I've seen a couple of framework doing something similar. When you download a dataset with license, the framework automatically will prompt a dialog for the consent to the data's license.

@bact any suggestion on this?

@bact
Copy link
Copy Markdown
Member

bact commented Jun 29, 2020

How would you bypass the login? I thought about doing this but I'm not sure how I should overcome the login.

Is it possible to download and host the data else where?

I've seen a couple of framework doing something similar. When you download a dataset with license, the framework automatically will prompt a dialog for the consent to the data's license.

@bact any suggestion on this?

We can do that as well.

This is something that may related #385 .

I'm still thinking. This is what I currently have in mind, still not certain:

  • For the installation of external dataset (does not come from PyThaiNLP project and need additional user acceptance of a license agreement) -- we should have either an interactive prompt or a configuration file that state the acceptance for each of the datasets that the user wishes to use

    • At the point of installation, a configuration file will be consulted. If an acceptance is found, the dataset will get install without prompt.
    • If the acceptance in the config file is not found, a prompt maybe shown. -- But this a use case without human attention should also be thinking about as well. In that case, we should raise an exception or an exit error code should be returned.
  • For the installation of any dataset that is from PyThaiNLP project. We may install it without prompt. Taken that the license agreement acceptance has been made already when the user download and install PyThaiNLP.

@p16i
Copy link
Copy Markdown
Contributor

p16i commented Jun 29, 2020

I think prompting is quite intuitive. If the user says yes, we can also create a file with their agreement answer. If this file already exists, the prompt can be bypassed, e.g. the user can manually creates the file as well.

@p16i
Copy link
Copy Markdown
Contributor

p16i commented Jun 29, 2020

One quick question: should we implement this functionality in this main repo?
It seems to me that having a separate repo might be better in term of releasing.

@wannaphong
Copy link
Copy Markdown
Member Author

One quick question: should we implement this functionality in this main repo?
It seems to me that having a separate repo might be better in term of releasing.

I agree. It might be better in term of both release and management.

@p16i
Copy link
Copy Markdown
Contributor

p16i commented Jun 30, 2020

I've actually tried to implement something similar but also including government API wrappers. Here is the project.

https://github.com/codeforthailand/databuri

I can transfer it to pythainlp. We might consider remove the API wrappers and implement only the dataset related stuff.

What do you think? I actually like the name. So, that is why I mention it here. We can also do everything from scratch.

@bact @wannaphong what do you think?

@wannaphong
Copy link
Copy Markdown
Member Author

I've actually tried to implement something similar but also including government API wrappers. Here is the project.

https://github.com/codeforthailand/databuri

I can transfer it to pythainlp. We might consider remove the API wrappers and implement only the dataset related stuff.

What do you think? I actually like the name. So, that is why I mention it here. We can also do everything from scratch.

@bact @wannaphong what do you think?

This project is very interesting. I think we can continue to develop this project.

@bact @cstorm125 @korakot what do you think?

@bact
Copy link
Copy Markdown
Member

bact commented Jul 2, 2020

One difference is one is accessing data from the network (internet is required) and another get it from a local file.

The nature of the data is probably different as well. One is more structured? Another is more like text?

I think in general it's good to provide some library to make it easier to get access to the data, to increase to use of the existing data. But not sure yet about the design.

@cstorm125
Copy link
Copy Markdown
Member

Adding BEST for word tokenization to huggingface/datasets huggingface/datasets#1385

@wannaphong
Copy link
Copy Markdown
Member Author

Adding BEST for word tokenization to huggingface/datasets huggingface/datasets#1385

I thinks i will close this pull request because you could use 🤗Datasets for read thai corpus.

@cstorm125
Copy link
Copy Markdown
Member

cstorm125 commented Dec 29, 2020

Adding BEST for word tokenization to huggingface/datasets huggingface/datasets#1385

I thinks i will close this pull request because you could use 🤗Datasets for read thai corpus.

🤗Datasets is a new hope for big success for Thai NLP.

@wannaphong wannaphong closed this Dec 29, 2020
@wannaphong wannaphong reopened this Dec 29, 2020
@bact bact added the enhancement enhance functionalities label Jan 7, 2021
@wannaphong wannaphong modified the milestones: 2.3, Future Feb 16, 2021
@wannaphong wannaphong closed this Mar 13, 2021
@wannaphong wannaphong deleted the add-best-corpus-reader branch January 21, 2025 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement enhance functionalities

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants