Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Klubička, Filip
dc.date.accessioned 2016-05-12T15:14:59Z
dc.date.available 2016-05-12T15:14:59Z
dc.date.issued 2016-05-12
dc.identifier.uri http://hdl.handle.net/11356/1062
dc.description The Bosnian web corpus bsWaC was built by crawling the .ba top-level domain in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration, morphosyntactically annotated and lemmatised. The corpus is shuffled by paragraphs. Each paragraph contains metadata on the URL, domain and language identification (Bosnian vs. Croatian vs. Serbian). Version 1.0 of this corpus is described in http://www.aclweb.org/anthology/W14-0405. Version 1.1 contains newer and better linguistic annotations.
dc.language.iso bos
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri http://nlp.ffzg.hr/resources/corpora/bswac/
dc.subject web corpus
dc.title Bosnian web corpus bsWaC 1.1
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nljubesi@gmail.com Jožef Stefan Institute
sponsor Swiss National Science Foundation 160501 ReLDI Other
size.info 286865790 tokens
size.info 12886124 sentences
size.info 896059 texts
files.count 3
files.size 1988514951
featuredService.kontext Search|https://www.clarin.si/kontext/first_form?corpname=bswac
featuredService.noske Search|https://www.clarin.si/ske/#dashboard?corpname=bswac


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
bsWaC1.1.01.xml.gz
Size
660.72 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
a96819c59df194644c4bb078f70189cf
 Download file
Icon
Name
bsWaC1.1.02.xml.gz
Size
660.27 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
565b6fe0e5fd917c8c57e3d00049f55d
 Download file
Icon
Name
bsWaC1.1.03.xml.gz
Size
575.41 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
7720dc117126fa6855cd21506606e176
 Download file

Show simple item record