[WIP] Add BEST I corpus reader#446
Conversation
|
Hello @wannaphong! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2020-12-29 12:15:43 UTC |
|
How would you bypass the login? I thought about doing this but I'm not sure how I should overcome the login. Is it possible to download and host the data else where? I've seen a couple of framework doing something similar. When you download a dataset with license, the framework automatically will prompt a dialog for the consent to the data's license. @bact any suggestion on this? |
We can do that as well. This is something that may related #385 . I'm still thinking. This is what I currently have in mind, still not certain:
|
|
I think prompting is quite intuitive. If the user says yes, we can also create a file with their agreement answer. If this file already exists, the prompt can be bypassed, e.g. the user can manually creates the file as well. |
|
One quick question: should we implement this functionality in this main repo? |
I agree. It might be better in term of both release and management. |
|
I've actually tried to implement something similar but also including government API wrappers. Here is the project. https://github.com/codeforthailand/databuri I can transfer it to pythainlp. We might consider remove the API wrappers and implement only the dataset related stuff. What do you think? I actually like the name. So, that is why I mention it here. We can also do everything from scratch. @bact @wannaphong what do you think? |
This project is very interesting. I think we can continue to develop this project. @bact @cstorm125 @korakot what do you think? |
|
One difference is one is accessing data from the network (internet is required) and another get it from a local file. The nature of the data is probably different as well. One is more structured? Another is more like text? I think in general it's good to provide some library to make it easier to get access to the data, to increase to use of the existing data. But not sure yet about the design. |
|
Adding BEST for word tokenization to huggingface/datasets huggingface/datasets#1385 |
I thinks i will close this pull request because you could use 🤗Datasets for read thai corpus. |
🤗Datasets is a new hope for big success for Thai NLP. |
I wrote a the BEST I corpus reader. It's can read BEST I corpus. It's return list words or
CoNLLfor NER.BEST I Corpus : nectec.or.th/corpus/index.php?league=pm
TODO
Add Test
Add docs