dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Hmeljak Sangawa, Kristina |
dc.contributor.author | Kawamura, Yoshiko |
dc.date.accessioned | 2015-08-05T12:53:31Z |
dc.date.available | 2015-08-05T12:53:31Z |
dc.date.issued | 2008-11-14 |
dc.identifier.uri | http://hdl.handle.net/11356/1047 |
dc.description | The corpus contains over 300 million words, with annotations of words and sentences describing their difficulty levels. Words are assigned levels of difficulty according to the Japanese Language Proficiency Test Content Specifications (2004). The difficulty level of the sentences is computed using various heuristics, based on the (difficulty level of) words, sentence length, etc. We distinguish 5 difficulty levels, from L0 (very difficult) to L4 (very easy). The corpus was collected from the Web using WaCkY tools, part-of-speech tagged and lemmatised with Chasen. The Japanese Chasen tags have also been converted to English language based tags. The corpora are made available in vertical format. Structural attributes are <text> and <s> (sentence). Each text gives its @url and @domain. Sentences have the @level attribute, which describes their difficulty level. The positional attributes are: 1. token, as it appears in the text 2. lemma of the word 3. Chasen tag, translated to English 4. original Chasen tag in Japanese 5. difficulty level of the word. The complete corpus is also split into sub-corpora of sentences with the same difficulty level. |
dc.language.iso | jpn |
dc.publisher | Jožef Stefan Institute |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://nl.ijs.si/jaslo/index-en.html#jpwac |
dc.subject | difficulty level |
dc.subject | teaching corpus |
dc.subject | TEI |
dc.title | Japanese web corpus with difficulty levels jpWaC-L 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
sponsor | Japan Foundation Japanese-Language Education Fellowship (Kristina Hmeljak Sangawa - March to July 2005) Other |
sponsor | ARRS/JSPS Slovenia-Japan bilateral project (Tomaž Erjavec - November 2008) Other |
size.info | 409030315 tokens |
files.count | 6 |
files.size | 1712910197 |
featuredService.kontext | Search jpWaC-L|https://www.clarin.si/kontext/first_form?corpname=jpwac_jp |
featuredService.kontext | Search jpWaC-L0|https://www.clarin.si/kontext/first_form?corpname=jpwac_l0 |
featuredService.kontext | Search jpWaC-L1|https://www.clarin.si/kontext/first_form?corpname=jpwac_l1 |
featuredService.kontext | Search jpWaC-L2|https://www.clarin.si/kontext/first_form?corpname=jpwac_l2 |
featuredService.kontext | Search jpWaC-L3|https://www.clarin.si/kontext/first_form?corpname=jpwac_l3 |
featuredService.kontext | Search jpWaC-L4|https://www.clarin.si/kontext/first_form?corpname=jpwac_l4 |
featuredService.noske | Search jpWaC-L|https://www.clarin.si/ske/#dashboard?corpname=jpwac_jp |
featuredService.noske | Search jpWaC-L0|https://www.clarin.si/ske/#dashboard?corpname=jpwac_l0 |
featuredService.noske | Search jpWaC-L1|https://www.clarin.si/ske/#dashboard?corpname=jpwac_l1 |
featuredService.noske | Search jpWaC-L2|https://www.clarin.si/ske/#dashboard?corpname=jpwac_l2 |
featuredService.noske | Search jpWaC-L3|https://www.clarin.si/ske/#dashboard?corpname=jpwac_l3 |
featuredService.noske | Search jpWaC-L4|https://www.clarin.si/ske/#dashboard?corpname=jpwac_l4 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- jpWaC-L4.vert.gz
- Size
- 1.24 MB
- Format
- application/gzip
- Description
- L4 sentences (very easy)
- MD5
- 50843dff6fcb5068d45703312de081c4

- Name
- jpWaC-L3.vert.gz
- Size
- 4.19 MB
- Format
- application/gzip
- Description
- L3 sentences (easy)
- MD5
- ed5f6d5ac497f9bccbb777d9aa9da16b

- Name
- jpWaC-L2.vert.gz
- Size
- 19.22 MB
- Format
- application/gzip
- Description
- L2 sentences (intemediate)
- MD5
- 23ea8a0e7710a5c63b3ce16a4b420fdd

- Name
- jpWaC-L1.vert.gz
- Size
- 8 MB
- Format
- application/gzip
- Description
- L1 sentences (difficult)
- MD5
- 8e959ec1bffbd26ca0b0fec29c31d222

- Name
- jpWaC-L0.vert.gz
- Size
- 178.92 MB
- Format
- application/gzip
- Description
- L0 sentences (very difficult)
- MD5
- 7ba968fa47702c36bd91d80c36fd5e4b

- Name
- jpWaC-L.vert.gz
- Size
- 1.39 GB
- Format
- application/gzip
- Description
- Complete Web corpus
- MD5
- 08bb1469bc3a21a2a34115c884392d70