| 2010-08-23(Mon) | Changed converted format for ease of use. Instead of XML format, we adopted KyotoCorpus- and CabochaOutoput-style formats. Added tags of anaphoric relations of determiners and pronouns. Added tags of noun classes of event nouns. |
| 2010-08-23(Mon) | Added tags of anaphoric relations of determiners and pronouns. |
| 2006-11-20(Mon) | Fixed crossing elements in XML. |
| 2006-10-17(Tue) | Added description of Instruction of converting rawdata into XML format to README. |
| 2006-10-6(Fri) | First beta release of predicate-argument and coreference relations tagged corpus. |
Fill out the form below and press submit button and download starts
immediately.
We annotated the same portion of Mainichi Shimbun Newspaper, which is used for Kyoto Text Corpus. It contains all articles (ca. 20,000 sentences) which start from 1 January 1995 and end with 17 January 1995, and all editorial articles (ca. 20,000 sentences) from January to December. We annotated predicate-argument relation (surface case: nominative, accusative, and dative cases), event noun and its relation (surface case: nominative, accusative, and dative cases), and coreference information to the corpus.
References
- Ryu Iida, Mamoru Komachi, Kentaro Inui and Yuji Matsumoto. Annotating a Japanese Text Corpus with Predicate-Argument and Coreference Relations. ACL Workshop `Linguistic Annotation Workshop', pp.132-139. 2007
Instruction of converting rawdata into KyotoCorpus- or CabochaOutput-style format
We only distribute tag information and their offsets. To obtain KyotoCorpus- or CabochaOutput-formatted data, you need
- CD-ROM of Mainichi Shimbun Newspaper of 1995
- perl 5.8.6 or higher
- Kyoto University Text Corpus 4.0
Instruction is as follows:
- Download KyotoCorpus.4.0.tar.gz from here
- Mount a CD-ROM of Mainichi Shimbun Newspaper of 1995
- Run the following commands:
% tar KyotoCorpus4.0.tar.gz
% KyotoCorpus4.0/auto_conv -d /mnt/cdrom
(Replace /mnt/cdrom with the mountpoint of CD-ROM) - To create KyotoCorpus-style data set, run the following commands:
% tar xvfz NTC_1.5.tgz
% NTC_1.5/auto_conv -k -d KyotoCorpus4.0/dat/syn/
(2,927 files will be generated in NTC_1.5/data/ntc/knp/) - To create CabochaOutput-style data set, run the following commands:
% tar xvfz NTC_1.5.tgz
% NTC_1.5/auto_conv -c -d KyotoCorpus4.0/dat/syn/
(2,927 files will be generated in NTC_1.5/data/ntc/ipa/)
You will need UNIX system to convert NAIST Text Corpus into the above formats.
Please send any questions, suggestions and comments to ryu-i@cl.cs.titech.ac.jp.