-
-
Notifications
You must be signed in to change notification settings - Fork 296
Add PhayaThaiBERT engine with new features [WIP] #873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 3 commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
1341a31
Merge pull request #2 from pavaris-pm/improve-pos-tag-transformers
pavaris-pm 38f71b5
Merge branch 'PyThaiNLP:dev' into dev
pavaris-pm 41d79c2
add phayathaibert core engine
pavaris-pm cb9e27a
add data augmentation engine
pavaris-pm 473af52
update engine properties
pavaris-pm 0c3efd0
updae augmentation properties
pavaris-pm 245e99e
Merge branch 'PyThaiNLP:dev' into dev
pavaris-pm dd2b834
change license
pavaris-pm d1b9c99
add er engine
pavaris-pm cbb7c8e
Update __init__.py
bact b71ebda
Merge branch 'PyThaiNLP:dev' into dev
pavaris-pm 348dc1f
add documentation and credit model builder
pavaris-pm c7b6900
Merge branch 'dev' into dev
pavaris-pm a55168a
update pep8
pavaris-pm 536f493
resolve conflict
pavaris-pm 76b49c3
update pep8
pavaris-pm 22daf2d
update pep8
pavaris-pm 84de5c4
Update core.py: sort imports, remove duplicated lines
bact a2fd4d3
Update phayathaibert.py: sort imports, remove duplicated lines
bact 7e24d3f
Reexport NamedEntityTagger
bact 826cfed
Fix minor types
bact 72e2bd5
Update __init__.py
bact dec62c1
Use MAX_NUM_AUGS constant for max num_augs limit
bact 9999f90
Merge branch 'PyThaiNLP:dev' into dev
pavaris-pm e7ef6ce
Update phayathaibert.py
bact File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| # -*- coding: utf-8 -*- | ||
| # Copyright (C) 2016-2023 PyThaiNLP Project | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| __all__ = [ | ||
| "PartOfSpeechTagger", | ||
| "segment", | ||
| ] | ||
|
|
||
| from pythainlp.phayathaibert.core import PartOfSpeechTagger, segment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| # -*- coding: utf-8 -*- | ||
| # Copyright (C) 2016-2023 PyThaiNLP Project | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| from typing import List, Tuple, Union | ||
| import re | ||
| import warnings | ||
| from transformers import ( | ||
| CamembertTokenizer, | ||
| ) | ||
|
|
||
|
|
||
| _model_name = "clicknext/phayathaibert" | ||
| _tokenizer = CamembertTokenizer.from_pretrained(_model_name) | ||
|
|
||
|
|
||
| class PartOfSpeechTagger: | ||
| def __init__(self, model: str="lunarlist/pos_thai_phayathai") -> None: | ||
| # Load model directly | ||
| from transformers import ( | ||
| AutoTokenizer, | ||
| AutoModelForTokenClassification, | ||
| ) | ||
| self.tokenizer = AutoTokenizer.from_pretrained(model) | ||
| self.model = AutoModelForTokenClassification.from_pretrained(model) | ||
|
|
||
| def get_tag(self, sentence: str, strategy: str='simple')->List[List[Tuple[str, str]]]: | ||
| from transformers import TokenClassificationPipeline | ||
| pipeline = TokenClassificationPipeline( | ||
| model=self.model, | ||
| tokenizer=self.tokenizer, | ||
| aggregation_strategy=strategy, | ||
| ) | ||
| outputs = pipeline(sentence) | ||
| word_tags = [[(tag['word'], tag['entity_group']) for tag in outputs]] | ||
| return word_tags | ||
|
|
||
| def segment(sentence: str)->List[str]: | ||
| if not sentence or not isinstance(sentence, str): | ||
| return [] | ||
|
|
||
| return _tokenizer.tokenize(sentence) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.