Lang-8 Learner Corpora

We created corpora of language learners texts from a language exchage SNS site Lang-8.

We solely use these corpora for research and educational purpose (ask for detail). If you would like to use them in commercial product, please directly contact Lang-8 support desk to obtain further information.

Lang-8 Learner Corpora list of URLs

This list contains the URLs of learners blog entries as of December 2010. It has 334,379 multilingual entries written by 59,455 active users. We used Japanese portion of the corpus for our IJCNLP 2011 paper.

Lang-8 Learner Corpora (raw format)

The corpora contains all the 80 languages supported by Lang-8. The top 20 languages are (counted by entry):

580549 total
237843 English
185991 Japanese
 45289 Unknown
 28154 Mandarin
 21779 Korean
 12606 Spanish
 12392 French
 11111 German
  4069 Russian
  4052 Traditional Chinese
  3339 Italian
  1135 Portuguese(Brazil)
   944 Swedish
   906 Turkish
   892 Indonesian
   803 Thai
   737 Arabic
   712 Finnish
   655 Vietnamese
   588 Dutch
   574 Afrikaans

Lang-8 Corpus of Learner English

This corpus contains English learners texts extracted from Lang-8. It has 100,051 English entries written by 29,012 active users. We also include automatic tense/aspect annotation used in our ACL 2012 paper.

Lang-8 Corpus of Romanized Learner Japanese

This corpus contains romanized Japanese learner texts extracted from Lang-8. It has 5,000 entries extracted from 350,000 Japanese sentences.

Lang-8 Corpus of Learner Japanese

This corpus contains sentence-aligned Japanese learner texts extracted from Lang-8. It has 2,246,059 parallel sentences. Deletion ([sline]...[/sline]) and color tags ([f-blue]...[/f-blue], [f-red]...[/f-red]) are processed as described in the IJCNLP 2011 paper.



Last Update: 2013-05-20

Tomoya Mizumoto<>
Nara Institute of Science and Technology, Japan