Data Format

For the data used for POS tagging and Dependency Parsing, our data format follows the CoNLL-X format. Following is an example:
1	No	_	RB	RB	_	7	discourse	_	_
2	,	_	,	,	_	7	punct	_	_
3	it	_	PR	PRP	_	7	nsubj	_	_
4	was	_	VB	VBD	_	7	cop	_	_
5	n't	_	RB	RB	_	7	neg	_	_
6	Black	_	NN	NNP	_	7	nn	_	_
7	Monday	_	NN	NNP	_	0	root	_	_
8	.	_	.	.	_	7	punct	_	_

For the data used for NER, our data format is similar to that used in CoNLL 2003 shared task, with a little bit difference. An example is in following:
1 EU NNP I-NP I-ORG
2 rejects VBZ I-VP O
3 German JJ I-NP I-MISC
4 call NN I-NP O
5 to TO I-VP O
6 boycott VB I-VP O
7 British JJ I-NP I-MISC
8 lamb NN I-NP O
9 . . O O

1 Peter NNP I-NP I-PER
2 Blackburn NNP I-NP I-PER
3 BRUSSELS NNP I-NP I-LOC
4 1996-08-22 CD I-NP O
...
where we add an column at the beginning to store the index of each word.

The original CoNLL-03 data can be downloaded here:
https://github.com/glample/tagger/tree/master/dataset

Make sure to convert the original tagging schema to the standard BIO (or more advanced BIOES)
Here is the code I used to convert it to BIO

```python
def transform(ifile, ofile):
	with open(ifile, 'r') as reader, open(ofile, 'w') as writer:
		prev = 'O'
		for line in reader:
			line = line.strip()
			if len(line) == 0:
				prev = 'O'
				writer.write('\n')
				continue

			tokens = line.split()
			# print tokens
			label = tokens[-1]
			if label != 'O' and label != prev:
				if prev == 'O':
					label = 'B-' + label[2:]
				elif label[2:] != prev[2:]:
					label = 'B-' + label[2:]
				else:
					label = label
			writer.write(" ".join(tokens[:-1]) + " " + label)
			writer.write('\n')
			prev = tokens[-1]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Format #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Data Format #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions