Show simple item record

 
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Hmeljak Sangawa, Kristina
dc.contributor.author Kawamura, Yoshiko
dc.date.accessioned 2015-08-05T12:53:31Z
dc.date.available 2015-08-05T12:53:31Z
dc.date.issued 2008-11-14
dc.identifier.uri http://hdl.handle.net/11356/1047
dc.description The corpus contains over 300 million words, with annotations of words and sentences describing their difficulty levels. Words are assigned levels of difficulty according to the Japanese Language Proficiency Test Content Specifications (2004). The difficulty level of the sentences is computed using various heuristics, based on the (difficulty level of) words, sentence length, etc. We distinguish 5 difficulty levels, from L0 (very difficult) to L4 (very easy). The corpus was collected from the Web using WaCkY tools, part-of-speech tagged and lemmatised with Chasen. The Japanese Chasen tags have also been converted to English language based tags. The corpora are made available in vertical format. Structural attributes are <text> and <s> (sentence). Each text gives its @url and @domain. Sentences have the @level attribute, which describes their difficulty level. The positional attributes are: 1. token, as it appears in the text 2. lemma of the word 3. Chasen tag, translated to English 4. original Chasen tag in Japanese 5. difficulty level of the word. The complete corpus is also split into sub-corpora of sentences with the same difficulty level.
dc.language.iso jpn
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://nl.ijs.si/jaslo/index-en.html#jpwac
dc.subject difficulty level
dc.subject teaching corpus
dc.subject TEI
dc.title Japanese web corpus with difficulty levels jpWaC-L 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
sponsor Japan Foundation Japanese-Language Education Fellowship (Kristina Hmeljak Sangawa - March to July 2005) Other
sponsor ARRS/JSPS Slovenia-Japan bilateral project (Tomaž Erjavec - November 2008) Other
size.info 409030315 tokens
files.count 6
files.size 1712910197
featuredService.kontext Search jpWaC-L|https://www.clarin.si/kontext/first_form?corpname=jpwac_jp
featuredService.kontext Search jpWaC-L0|https://www.clarin.si/kontext/first_form?corpname=jpwac_l0
featuredService.kontext Search jpWaC-L1|https://www.clarin.si/kontext/first_form?corpname=jpwac_l1
featuredService.kontext Search jpWaC-L2|https://www.clarin.si/kontext/first_form?corpname=jpwac_l2
featuredService.kontext Search jpWaC-L3|https://www.clarin.si/kontext/first_form?corpname=jpwac_l3
featuredService.kontext Search jpWaC-L4|https://www.clarin.si/kontext/first_form?corpname=jpwac_l4
featuredService.noske Search jpWaC-L|https://www.clarin.si/ske/#dashboard?corpname=jpwac_jp
featuredService.noske Search jpWaC-L0|https://www.clarin.si/ske/#dashboard?corpname=jpwac_l0
featuredService.noske Search jpWaC-L1|https://www.clarin.si/ske/#dashboard?corpname=jpwac_l1
featuredService.noske Search jpWaC-L2|https://www.clarin.si/ske/#dashboard?corpname=jpwac_l2
featuredService.noske Search jpWaC-L3|https://www.clarin.si/ske/#dashboard?corpname=jpwac_l3
featuredService.noske Search jpWaC-L4|https://www.clarin.si/ske/#dashboard?corpname=jpwac_l4


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
jpWaC-L4.vert.gz
Size
1.24 MB
Format
application/gzip
Description
L4 sentences (very easy)
MD5
50843dff6fcb5068d45703312de081c4
 Download file
Icon
Name
jpWaC-L3.vert.gz
Size
4.19 MB
Format
application/gzip
Description
L3 sentences (easy)
MD5
ed5f6d5ac497f9bccbb777d9aa9da16b
 Download file
Icon
Name
jpWaC-L2.vert.gz
Size
19.22 MB
Format
application/gzip
Description
L2 sentences (intemediate)
MD5
23ea8a0e7710a5c63b3ce16a4b420fdd
 Download file
Icon
Name
jpWaC-L1.vert.gz
Size
8 MB
Format
application/gzip
Description
L1 sentences (difficult)
MD5
8e959ec1bffbd26ca0b0fec29c31d222
 Download file
Icon
Name
jpWaC-L0.vert.gz
Size
178.92 MB
Format
application/gzip
Description
L0 sentences (very difficult)
MD5
7ba968fa47702c36bd91d80c36fd5e4b
 Download file
Icon
Name
jpWaC-L.vert.gz
Size
1.39 GB
Format
application/gzip
Description
Complete Web corpus
MD5
08bb1469bc3a21a2a34115c884392d70
 Download file

Show simple item record