Skip to content

Commit 06d3e09

Browse files
committed
update README.md
1 parent d7118f8 commit 06d3e09

File tree

1 file changed

+36
-17
lines changed

1 file changed

+36
-17
lines changed

README.md

+36-17
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Kyoto University Web Document Leads Corpus
22

3-
### Overview
3+
## Overview
44

55
This is a Japanese text corpus that consists of lead three sentences
66
of web documents with various linguistic annotations. By collecting
@@ -17,7 +17,7 @@ analyses of the morphological analyzer JUMAN and the dependency, case
1717
structure and anaphora analyzer KNP. The discourse annotations were
1818
given by two types of annotators; experts and crowd workers.
1919

20-
### Notes
20+
## Notes
2121

2222
This corpus consists of linguistically annotated Web documents that
2323
have been made publicly available on the Web at some time. The corpus
@@ -32,7 +32,7 @@ the addition of source information or deletion of these documents, we will
3232
update the corpus and newly release it. In this case, please delete
3333
the downloaded old version and replace it with the new version.
3434

35-
### Notes on annotation guidelines
35+
## Notes on annotation guidelines
3636

3737
The annotation guidelines for this corpus are written in the manuals
3838
found in the "doc" directory. The guidelines for morphology and
@@ -42,17 +42,25 @@ rel_guideline.pdf, and those for discourse relations are described in
4242
disc_guideline.pdf. The guidelines for named entities are available on
4343
the IREX website (<http://nlp.cs.nyu.edu/irex/>).
4444

45-
### Distributed files
45+
## Distributed files
4646

47-
* `knp/`: the corpus annotated with morphology, named entities, dependencies, predicate-argument structures, and coreferences
47+
* `knp/`: the corpus annotated with morphology, named entities, dependencies, predicate-argument structures, and
48+
coreferences
4849
* `disc/`: the corpus annotated with discourse relations
4950
* `org/`: the raw corpus
5051
* `doc/`: annotation guidelines
5152
* `id/`: document id files providing train/test split
5253

53-
Note that the encoding of the corpus data is UTF-8.
54+
## Statistics
5455

55-
### Format of the corpus annotated with annotations of morphology, named entities, dependencies, predicate-argument structures, and coreferences
56+
| | # of documents | # of sentences | # of morphemes | # of named entities | # of predicates | # of coreferring mentions |
57+
|-------|---------------:|---------------:|---------------:|--------------------:|----------------:|--------------------------:|
58+
| train | 3,915 | 11,745 | 194,490 | 6,267 | 51,702 | 16,079 |
59+
| dev | 512 | 1,536 | 22,625 | 974 | 6,139 | 1,641 |
60+
| test | 700 | 2,100 | 35,869 | 1,122 | 9,549 | 3,074 |
61+
| total | 5,127 | 15,381 | 252,984 | 8,363 | 67,390 | 20,794 |
62+
63+
## Format of the corpus annotated with annotations of morphology, named entities, dependencies, predicate-argument structures, and coreferences
5664

5765
Annotations of this corpus are given in the following format.
5866

@@ -117,7 +125,7 @@ respectively. If a basic phrase has multiple tags of the same type, a
117125
"?." The details of these attributes are described in the annotation
118126
guidelines (rel_guideline.pdf).
119127

120-
### Format of the corpus annotated with discourse relations
128+
## Format of the corpus annotated with discourse relations
121129

122130
In this corpus, a clause pair is given a discourse type and its votes as follows.
123131

@@ -142,17 +150,28 @@ by experts, the discourse direction is annotated; if it is reverse order,
142150
methods and discourse relations are described in [Kawahara et al., 2014]
143151
and the annotation guidelines (disc_guideline.pdf).
144152

145-
### References
153+
## References
146154

147-
* Masatsugu Hangyo, Daisuke Kawahara and Sadao Kurohashi. Building a Diverse Document Leads Corpus Annotated with Semantic Relations, In Proceedings of the 26th Pacific Asia Conference on Language Information and Computing, pp.535-544, 2012. <http://www.aclweb.org/anthology/Y/Y12/Y12-1058.pdf>
148-
* 萩行正嗣, 河原大輔, 黒橋禎夫. 多様な文書の書き始めに対する意味関係タグ付きコーパスの構築とその分析, 自然言語処理, Vol.21, No.2, pp.213-248, 2014. <https://doi.org/10.5715/jnlp.21.213>
149-
* Daisuke Kawahara, Yuichiro Machida, Tomohide Shibata, Sadao Kurohashi, Hayato Kobayashi and Manabu Sassano. Rapid Development of a Corpus with Discourse Annotations using Two-stage Crowdsourcing, In Proceedings of the 25th International Conference on Computational Linguistics, pp.269-278, 2014. <http://www.aclweb.org/anthology/C/C14/C14-1027.pdf>
150-
* 岸本裕大, 村脇有吾, 河原大輔, 黒橋禎夫. 日本語談話関係解析:タスク設計・談話標識の自動認識・ コーパスアノテーション, 自然言語処理, Vol.27, No.4, pp.889-931, 2020. <https://doi.org/10.5715/jnlp.27.889>
155+
* Masatsugu Hangyo, Daisuke Kawahara and Sadao Kurohashi. Building a Diverse Document Leads Corpus Annotated with
156+
Semantic Relations, In Proceedings of the 26th Pacific Asia Conference on Language Information and Computing,
157+
pp.535-544, 2012. <http://www.aclweb.org/anthology/Y/Y12/Y12-1058.pdf>
158+
* 萩行正嗣, 河原大輔, 黒橋禎夫. 多様な文書の書き始めに対する意味関係タグ付きコーパスの構築とその分析, 自然言語処理,
159+
Vol.21, No.2, pp.213-248, 2014. <https://doi.org/10.5715/jnlp.21.213>
160+
* Daisuke Kawahara, Yuichiro Machida, Tomohide Shibata, Sadao Kurohashi, Hayato Kobayashi and Manabu Sassano. Rapid
161+
Development of a Corpus with Discourse Annotations using Two-stage Crowdsourcing, In Proceedings of the 25th
162+
International Conference on Computational Linguistics, pp.269-278,
163+
2014. <http://www.aclweb.org/anthology/C/C14/C14-1027.pdf>
164+
* 岸本裕大, 村脇有吾, 河原大輔, 黒橋禎夫. 日本語談話関係解析:タスク設計・談話標識の自動認識・ コーパスアノテーション,
165+
自然言語処理, Vol.27, No.4, pp.889-931, 2020. <https://doi.org/10.5715/jnlp.27.889>
151166

152-
### Acknowledgment
167+
## Acknowledgment
153168

154-
The creation of this corpus was supported by JSPS KAKENHI Grant Number 24300053 and JST CREST "Advanced Core Technologies for Big Data Integration." The discourse annotations were acquired by crowdsourcing under the support of Yahoo! Japan Corporation. We deeply appreciate their support.
169+
The creation of this corpus was supported by JSPS KAKENHI Grant Number 24300053 and JST CREST "Advanced Core
170+
Technologies for Big Data Integration." The discourse annotations were acquired by crowdsourcing under the support of
171+
Yahoo! Japan Corporation. We deeply appreciate their support.
155172

156-
### Contact
173+
## Contact
157174

158-
If you have any questions or problems with this corpus, please send an email to nl-resource at nlp.ist.i.kyoto-u.ac.jp. If you have a request to add source information or to delete a document in the corpus, please send an email to this mail address.
175+
If you have any questions or problems with this corpus, please send an email to nl-resource at nlp.ist.i.kyoto-u.ac.jp.
176+
If you have a request to add source information or to delete a document in the corpus, please send an email to this mail
177+
address.

0 commit comments

Comments
 (0)