Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Klubička, Filip
dc.date.accessioned 2016-05-12T16:25:34Z
dc.date.available 2016-05-12T16:25:34Z
dc.date.issued 2016-05-12
dc.identifier.uri http://hdl.handle.net/11356/1064
dc.description The Croatian web corpus hrWaC was built by crawling the .hr top-level domain in 2011 and again in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration, morphosyntactically annotated and lemmatised. The corpus is shuffled by paragraphs. Each paragraph contains metadata on the URL, domain and language identification (Croatian vs. Serbian). Version 2.0 of this corpus is described in http://www.aclweb.org/anthology/W14-0405. Version 2.1 contains newer and better linguistic annotations.
dc.language.iso hrv
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri http://nlp.ffzg.hr/resources/corpora/hrwac/
dc.subject web corpus
dc.title Croatian web corpus hrWaC 2.1
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nljubesi@gmail.com Jožef Stefan Institute
sponsor Swiss National Science Foundation 160501 ReLDI Other
size.info 1397757548 tokens
size.info 67403219 sentences
size.info 3611090 texts
files.count 15
files.size 9893367868
featuredService.kontext Search|https://www.clarin.si/kontext/first_form?corpname=hrwac
featuredService.noske Search|https://www.clarin.si/noske/run.cgi/corp_info?corpname=hrwac


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
hrWaC2.1.01.xml.gz
Size
653.91 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
c4a03b997881dc7c8a8d5f892f44a024
 Download file
Icon
Name
hrWaC2.1.02.xml.gz
Size
649.14 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
7158987cecd685e4f2cb280c5bdd16b2
 Download file
Icon
Name
hrWaC2.1.03.xml.gz
Size
646.02 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
fd5ba00acfe24fbb1e5f157cd9769c74
 Download file
Icon
Name
hrWaC2.1.04.xml.gz
Size
645.11 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
4a4da519b35dfa74da01083b5a727b61
 Download file
Icon
Name
hrWaC2.1.05.xml.gz
Size
644.82 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
e375b646d3b78a1a9963f55f69517388
 Download file
Icon
Name
hrWaC2.1.05.xml.gz
Size
644.82 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
e375b646d3b78a1a9963f55f69517388
 Download file
Icon
Name
hrWaC2.1.06.xml.gz
Size
644.39 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
075ea103d086aa3c3419b84fc3fb0a82
 Download file
Icon
Name
hrWaC2.1.07.xml.gz
Size
639.35 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
aa7389aa6e3d325de42543aa029d18e5
 Download file
Icon
Name
hrWaC2.1.08.xml.gz
Size
612.43 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
f2d050b40501cb5f1cdefd8c33cde12b
 Download file
Icon
Name
hrWaC2.1.09.xml.gz
Size
602.48 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
8a45a4f4a9b665fc2929b88597f6febc
 Download file
Icon
Name
hrWaC2.1.10.xml.gz
Size
561.95 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
295c08e34eca5ea71bb5e40b984ce946
 Download file
Icon
Name
hrWaC2.1.11.xml.gz
Size
590.42 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
660bb8bb0aa0cb273e3d10b43c58e500
 Download file
Icon
Name
hrWaC2.1.12.xml.gz
Size
644.51 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
c9501c5acb3ad5d9b29fcfd10a37d265
 Download file
Icon
Name
hrWaC2.1.13.xml.gz
Size
637.22 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
89ceac9dffa1322ffc7bbc587461d079
 Download file
Icon
Name
hrWaC2.1.14.xml.gz
Size
618.46 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
c45faa5d3a3b3ae87002b996b3901a09
 Download file

Show simple item record