GitHub

ChineseNER

运行环境：python3 TensorFlow 1.2

Train

使用CreateData.py修改输入文件的格式，从分词变为单个单词，从BIO标签转为BI标签

修改CreateData.py中的input和output（input为文件train.txt的路径）

python CreateData.py

将输出的output文件按90%:10%的比例分成训练集和验证集，改名为train_data和test_data放到data_path文件夹中

调整超参数进行训练

python main.py --mode=train

可以对main.py中的超参数修改，如

python main.py --mode=train --epoch=30

使用模型进行预测

main.py中，有两行

elif args.mode == 'demo\':

运行

python main.py --mode=demo --demo_model=1543146557

这里暂时使用注释的方式切换两种使用方式。如想使用后者，请将前者注释，后者取消注释。前者可以直接输入句子进行测试并产生输出，后者可以预测文件并产生输出。在使用后者时记得修改里面的input和output路径

最后将输出的output文件中所有0替换成O

最后结果：F1 Score =0.885836

原项目的地址https://github.com/Determined22/zh-NER-TF

以下是原项目的readme.md

A simple BiLSTM-CRF model for Chinese Named Entity Recognition

This repository includes the code for buliding a very simple character-based BiLSTM-CRF sequence labelling model for Chinese Named Entity Recognition task. Its goal is to recognize three types of Named Entity: PERSON, LOCATION and ORGANIZATION.

This code works on Python 3 & TensorFlow 1.2 and the following repository https://github.com/guillaumegenthial/sequence_tagging gives me much help.

model

This model is similar to the models provied by paper [1] and [2]. Its structure looks just like the following illustration:

For one Chinese sentence, each character in this sentence has / will have a tag which belongs to the set {O, B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG}.

The first layer, look-up layer, aims at transforming character representation from one-hot vector into character embedding. In this code I initialize the embedding matrix randomly and I know it looks too simple. We could add some language knowledge later. For example, do tokenization and use pre-trained word-level embedding, then every character in one token could be initialized with this token's word embedding. In addition, we can get the character embedding by combining low-level features (please see paper[2]'s section 4.1 and paper[3]'s section 3.3 for more details).

The second layer, BiLSTM layer, can efficiently use both past and future input information and extract features automatically.

The third layer, CRF layer, labels the tag for each character in one sentence. If we use Softmax layer for labelling we might get ungrammatic tag sequences beacuse Softmax could only label each position independently. We know that 'I-LOC' cannot follow 'B-PER' but Softmax don't know. Compared to Softmax layer, CRF layer could use sentence-level tag information and model the transition behavior of each two different tags.

dataset

	#sentence	#PER	#LOC	#ORG
train	46364	17615	36517	20571
test	4365	1973	2877	1331

It looks like a portion of MSRA corpus.

train

python main.py --mode=train

test

python main.py --mode=test --demo_model=1521112368

Please set the parameter --demo_model to the model which you want to test. 1521112368 is the model trained by me.

An official evaluation tool: here (click 'Instructions')

My test performance:

P	R	F	F (PER)	F (LOC)	F (ORG)
0.8945	0.8752	0.8847	0.8688	0.9118	0.8515

demo

python main.py --mode=demo --demo_model=1521112368

You can input one Chinese sentence and the model will return the recognition result:

references

[1] Bidirectional LSTM-CRF Models for Sequence Tagging

[2] Neural Architectures for Named Entity Recognition

[3] Character-Based LSTM-CRF with Radical-Level Features for Chinese Named Entity Recognition

[4] https://github.com/guillaumegenthial/sequence_tagging

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data_path		data_path
data_path_save		data_path_save
mydata		mydata
pics		pics
.gitignore		.gitignore
CreateData.py		CreateData.py
README.md		README.md
conlleval_rev.pl		conlleval_rev.pl
data.py		data.py
eval.py		eval.py
main.py		main.py
model.py		model.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ChineseNER

Train

使用CreateData.py修改输入文件的格式，从分词变为单个单词，从BIO标签转为BI标签

调整超参数进行训练

使用模型进行预测

A simple BiLSTM-CRF model for Chinese Named Entity Recognition

model

dataset

train

test

demo

references

About

Uh oh!

Releases

Packages

Languages

zcoo/ChineseNER

Folders and files

Latest commit

History

Repository files navigation

ChineseNER

Train

使用CreateData.py修改输入文件的格式，从分词变为单个单词，从BIO标签转为BI标签

调整超参数进行训练

使用模型进行预测

A simple BiLSTM-CRF model for Chinese Named Entity Recognition

model

dataset

train

test

demo

references

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages