| dc.contributor.author |
Ljubešić, Nikola |
| dc.contributor.author |
Esplà-Gomis, Miquel |
| dc.contributor.author |
Ortiz Rojas, Sergio |
| dc.contributor.author |
Klubička, Filip |
| dc.contributor.author |
Toral, Antonio |
| dc.date.accessioned |
2016-03-09T16:47:40Z |
| dc.date.available |
2016-03-09T16:47:40Z |
| dc.date.issued |
2016-03-09 |
| dc.identifier.uri |
http://hdl.handle.net/11356/1058 |
| dc.description |
The hrenWaC corpus version 2.0 consists of parallel Croatian-English texts crawled from the .hr top-level domain for Croatia. The corpus was built with Spidextor (https://github.com/abumatran/spidextor), a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext on the segment level is around 80% and on the word level around 84%. |
| dc.language.iso |
hrv |
| dc.language.iso |
eng |
| dc.publisher |
Jožef Stefan Institute |
| dc.relation |
info:eu-repo/grantAgreement/EC/FP7/324414 |
| dc.rights |
CLARIN.SI User Licence for Internet Corpora |
| dc.rights.uri |
https://www.clarin.si/info/wp-content/uploads/2016/01/CLARIN.SI-WAC-2016-01.pdf |
| dc.rights.label |
ACA |
| dc.source.uri |
http://nlp.ffzg.hr/resources/corpora/hrenwac/ |
| dc.subject |
parallel corpus |
| dc.subject |
web corpus |
| dc.subject |
multilingual |
| dc.title |
Croatian-English parallel corpus hrenWaC 2.0 |
| dc.type |
corpus |
| metashare.ResourceInfo#ContentInfo.mediaType |
text |
| has.files |
yes |
| branding |
CLARIN.SI data & tools |
| contact.person |
Nikola Ljubešić nljubesi@gmail.com Jožef Stefan Institute |
| sponsor |
European Union FP7-PEOPLE-2012-IAPP PIAP-GA-2012-324414 Abu-MaTran euFunds info:eu-repo/grantAgreement/EC/FP7/324414 |
| size.info |
55083246 words |
| size.info |
1554912 sentences |
| files.count |
1 |
| files.size |
195521908 |