Skip to content

Commit 932da3e

Browse files
author
minghui.qmh
committed
init commit
1 parent e7b2f6d commit 932da3e

File tree

2 files changed

+192
-0
lines changed

2 files changed

+192
-0
lines changed

.pre-commit-config.yaml

+44
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
repos:
2+
- repo: https://gitlab.com/pycqa/flake8.git
3+
rev: 3.8.3
4+
hooks:
5+
- id: flake8
6+
additional_dependencies: [
7+
'flake8-docstrings==1.5.0'
8+
]
9+
- repo: https://github.com/asottile/seed-isort-config
10+
rev: v2.2.0
11+
hooks:
12+
- id: seed-isort-config
13+
- repo: https://github.com/timothycrosley/isort
14+
rev: 4.3.21
15+
hooks:
16+
- id: isort
17+
- repo: https://github.com/pre-commit/mirrors-yapf
18+
rev: v0.30.0
19+
hooks:
20+
- id: yapf
21+
- repo: https://github.com/pre-commit/pre-commit-hooks
22+
rev: v3.1.0
23+
hooks:
24+
- id: trailing-whitespace
25+
args: ["--no-markdown-linebreak-ext"]
26+
- id: check-yaml
27+
- id: end-of-file-fixer
28+
- id: requirements-txt-fixer
29+
- id: double-quote-string-fixer
30+
- id: check-merge-conflict
31+
- id: mixed-line-ending
32+
args: ["--fix=lf"]
33+
- repo: https://github.com/myint/docformatter
34+
rev: v1.3.1
35+
hooks:
36+
- id: docformatter
37+
args: ["--in-place", "--wrap-descriptions", "0", "--wrap-summaries", "0"]
38+
- repo: https://github.com/executablebooks/mdformat
39+
rev: 0.7.1
40+
hooks:
41+
- id: mdformat
42+
additional_dependencies: [
43+
'mdformat-tables==0.4.0'
44+
]

README.md

+148
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# EasyNLP: An Easy-to-use NLP Toolkit
2+
3+
<p align="center">
4+
<br>
5+
<img src="https://cdn.nlark.com/yuque/0/2022/png/2480469/1649297935073-2fce0ec9-ec8c-490f-bc25-a8cf50d9918f.png" width="200"/>
6+
<br>
7+
<p>
8+
9+
<p align="center"> <b> EasyNLP is designed to make it easy to develp NLP applications. </b> </p>
10+
<p align="center">
11+
<a href="https://www.yuque.com/easyx/easynlp/iobg30">
12+
<img src="https://cdn.nlark.com/yuque/0/2020/svg/2480469/1600310258840-bfe6302e-d934-409d-917c-8eab455675c1.svg" height="24">
13+
</a>
14+
<a href="https://dsw-dev.data.aliyun.com/#/?fileUrl=https://raw.githubusercontent.com/alibaba/EasyTransfer/master/examples/easytransfer-quick_start.ipynb&fileName=easytransfer-quick_start.ipynb">
15+
<img src="https://cdn.nlark.com/yuque/0/2020/svg/2480469/1600310258886-ad896af5-b7da-4ca6-8369-4b14c23cb7a3.svg" height="24">
16+
</a>
17+
</p>
18+
19+
EasyNLP is an easy-to-use NLP development and application toolkit in PyTorch, first released inside Alibaba in 2021. It is built with scalable distributed training strategies, and supports a comprehensive suite of NLP algorithms for various NLP applications. EasyNLP integrates knowledge distillation and few-shot learning for landing large pre-trained models, and provides a unified framework of model training, inference, and deployment for real-world applications. It has powered more than 10 BUs and more than 20 business scenarios within the Alibaba group. It is seamlessly integrated to Platform of AI (PAI) products, includeing PAI-DSW for development, PAI-DLC for cloud-native training, PAI-EAS for serving, and PAI-Designer for zero-code model training.
20+
21+
# Main Features
22+
23+
- **Easy to use and highly customizable:** In addition to providing easy-to-use and concise commands to call cutting-edge models, it also abstracts certain custom modules such as AppZoo and ModelZoo to make it easy to build NLP applications. It is equiped with the PAI PyTorch distributed training framework TorchAccerator to speed up distributed training.
24+
- **Compatible with open-source libraries:** EasyNLP has APIs to support the training of models from Huggingface/Transformers with the PAI distributed framework. It also supports the pre-trained models in EasyTransfer ModelZoo.
25+
- **Knowledge-injected pre-training:** The PAI team has a lot of research on knowledge-injected pre-training, and builds a knowledge-injected model that wins the first place in the CCF knowledge pre-training competition. EasyNLP integrates these cutting-edge knowledge pre-trained models, including DKPLM and KGBERT.
26+
- **Landing large pre-trained models:** EasyNLP provides few-shot learning capabilities, allowing users to finetune large models with only a few samples to achieve good results. At the same time, it provides knowledge distillation functions to help quickly distill large models to a small and efficient model to faciliate online deployment.
27+
- **Seamless integration to PAI products::** It is seamlessly integrated to [Platform of AI (PAI)](https://www.aliyun.com/product/bigdata/product/learn) products, including PAI-DSW for development, PAI-DLC for cloud-native training, PAI-EAS for serving, and PAI-Designer for zero-coding model training.
28+
29+
30+
31+
# Installation
32+
33+
You can either install from pip
34+
35+
```bash
36+
$ pip install pai-easynlp (to be released)
37+
```
38+
39+
or setup from the source:
40+
41+
```bash
42+
$ git clone https://github.com/alibaba/EasyNLP.git
43+
$ cd EasyNLP
44+
$ python setup.py install
45+
```
46+
This repo is tested on Python3.6, PyTorch >= 1.8.
47+
48+
49+
# Quick Start
50+
Now let's show how to use just a few lines of code to build a text classification model based on BERT.
51+
52+
```python
53+
from easynlp.core import Trainer
54+
from easynlp.appzoo import ClassificationDataset, SequenceClassification
55+
from easynlp.utils import initialize_easynlp
56+
57+
args = initialize_easynlp()
58+
59+
train_dataset = ClassificationDataset(
60+
pretrained_model_name_or_path=args.pretrained_model_name_or_path,
61+
data_file=args.tables,
62+
max_seq_length=args.sequence_length,
63+
input_schema=args.input_schema,
64+
first_sequence=args.first_sequence,
65+
label_name=args.label_name,
66+
label_enumerate_values=args.label_enumerate_values,
67+
is_training=True)
68+
69+
model = SequenceClassification(pretrained_model_name_or_path=args.pretrained_model_name_or_path)
70+
Trainer(model=model, train_dataset=train_dataset).train()
71+
```
72+
73+
Then you can run the code:
74+
```bash
75+
python main.py \
76+
--mode train \
77+
--tables=train_toy.tsv \
78+
--input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
79+
--first_sequence=sent1 \
80+
--label_name=label \
81+
--label_enumerate_values=0,1 \
82+
--checkpoint_dir=./tmp/ \
83+
--epoch_num=1 \
84+
--app_name=text_classify \
85+
--user_defined_parameters='pretrain_model_name_or_path=bert-tiny-uncased'
86+
```
87+
88+
You can also use AppZoo Command Line Tools to quickly train an App model. Take text classification on SST-2 dataset as an example. First you can download the [train.tsv](http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/classification/train.tsv), and [dev.tsv](http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/classification/dev.tsv), then start training:
89+
90+
```bash
91+
$ easynlp \
92+
--mode=train \
93+
--worker_gpu=1 \
94+
--tables=train.tsv,dev.tsv \
95+
--input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
96+
--first_sequence=sent1 \
97+
--label_name=label \
98+
--label_enumerate_values=0,1 \
99+
--checkpoint_dir=./classification_model \
100+
--epoch_num=1 \
101+
--sequence_length=128 \
102+
--app_name=text_classify \
103+
--user_defined_parameters='pretrain_model_name_or_path=bert-small-uncased'
104+
```
105+
106+
And then predict:
107+
108+
```bash
109+
$ easynlp \
110+
--mode=predict \
111+
--tables=dev.tsv \
112+
--outputs=dev.pred.tsv \
113+
--input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1 \
114+
--output_schema=predictions,probabilities,logits,output \
115+
--append_cols=label \
116+
--first_sequence=sent1 \
117+
--checkpoint_path=./classification_model \
118+
--app_name=text_classify
119+
```
120+
To learn more about the usage of AppZoo, please refer to our [documentation](https://www.yuque.com/easyx/easynlp/psm6fr).
121+
122+
123+
124+
# Tutorials
125+
126+
- [AppZoo-文本向量化](https://www.yuque.com/easyx/easynlp/ts4czl)
127+
- [AppZoo-文本分类/匹配](https://www.yuque.com/easyx/easynlp/vgbopy)
128+
- [AppZoo-序列标注](https://www.yuque.com/easyx/easynlp/vgbopy)
129+
- [AppZoo-GEEP文本分类](https://www.yuque.com/easyx/easynlp/vgbopy)
130+
- [基础预训练实践](https://www.yuque.com/easyx/easynlp/vgbopy)
131+
- [知识预训练实践](https://www.yuque.com/easyx/easynlp/vgbopy)
132+
- [知识蒸馏实践](https://www.yuque.com/easyx/easynlp/vgbopy)
133+
- [小样本学习实践](https://www.yuque.com/easyx/easynlp/vgbopy)
134+
- API docs: [http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/easynlp/easynlp_docs/html/index.html](http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/easynlp/easynlp_docs/html/index.html)
135+
136+
137+
# Contact Us
138+
Scan the following QR codes to join Dingtalk discussion group. The group discussions are most in Chinese, but English is also welcomed.
139+
140+
<img src="https://cdn.nlark.com/yuque/0/2020/png/2480469/1600310258842-d7121051-32f1-494b-a7a5-a35ede74b6c4.png#align=left&display=inline&height=352&margin=%5Bobject%20Object%5D&name=image.png&originHeight=1178&originWidth=1016&size=312154&status=done&style=none&width=304" width="300"/>
141+
142+
# Reference
143+
144+
- EasyTransfer: https://github.com/alibaba/EasyTransfer
145+
- DKPLM: https://arxiv.org/abs/2112.01047
146+
- MetaKD: https://arxiv.org/abs/2012.01266
147+
- CP-Tuning: https://arxiv.org/abs/2204.00166
148+
- FashionBERT: https://arxiv.org/abs/2005.09801

0 commit comments

Comments
 (0)