Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Esplà-Gomis, Miquel
dc.contributor.author Ortiz Rojas, Sergio
dc.contributor.author Klubička, Filip
dc.contributor.author Toral, Antonio
dc.date.accessioned 2016-03-09T16:47:40Z
dc.date.available 2016-03-09T16:47:40Z
dc.date.issued 2016-03-09
dc.identifier.uri http://hdl.handle.net/11356/1058
dc.description The hrenWaC corpus version 2.0 consists of parallel Croatian-English texts crawled from the .hr top-level domain for Croatia. The corpus was built with Spidextor (https://github.com/abumatran/spidextor), a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext on the segment level is around 80% and on the word level around 84%.
dc.language.iso hrv
dc.language.iso eng
dc.publisher Jožef Stefan Institute
dc.relation info:eu-repo/grantAgreement/EC/FP7/324414
dc.rights CLARIN.SI User Licence for Internet Corpora
dc.rights.uri https://www.clarin.si/info/wp-content/uploads/2016/01/CLARIN.SI-WAC-2016-01.pdf
dc.rights.label ACA
dc.source.uri http://nlp.ffzg.hr/resources/corpora/hrenwac/
dc.subject parallel corpus
dc.subject web corpus
dc.subject multilingual
dc.title Croatian-English parallel corpus hrenWaC 2.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nljubesi@gmail.com Jožef Stefan Institute
sponsor European Union FP7-PEOPLE-2012-IAPP PIAP-GA-2012-324414 Abu-MaTran euFunds info:eu-repo/grantAgreement/EC/FP7/324414
size.info 55083246 words
size.info 1554912 sentences
files.count 1
files.size 195521908


 Files in this item

This item is
Academic Use
and licensed under:
CLARIN.SI User Licence for Internet Corpora
Attribution Required Noncommercial
Icon
Name
hrenwac_v2.0.tmx.tgz
Size
186.46 MB
Format
Unknown
MD5
a0e008f53bcfe50beebc08b134b3ed69
 Download file

Show simple item record