dc.contributor.author |
Ljubešić, Nikola |
dc.contributor.author |
Esplà-Gomis, Miquel |
dc.contributor.author |
Ortiz Rojas, Sergio |
dc.contributor.author |
Klubička, Filip |
dc.contributor.author |
Toral, Antonio |
dc.date.accessioned |
2016-03-10T15:21:18Z |
dc.date.available |
2016-03-10T15:21:18Z |
dc.date.issued |
2016-03-10 |
dc.identifier.uri |
http://hdl.handle.net/11356/1061 |
dc.description |
The slenWaC corpus version 1.0 consists of parallel Slovene-English texts crawled from the .si top-level domain for Slovenia. The corpus was built with Spidextor (https://github.com/abumatran/spidextor), a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext on the segment level is around 67% and on the word level around 68%. |
dc.language.iso |
slv |
dc.language.iso |
eng |
dc.publisher |
Jožef Stefan Institute |
dc.relation |
info:eu-repo/grantAgreement/EC/FP7/324414 |
dc.rights |
CLARIN.SI User Licence for Internet Corpora |
dc.rights.uri |
https://www.clarin.si/info/wp-content/uploads/2016/01/CLARIN.SI-WAC-2016-01.pdf |
dc.rights.label |
ACA |
dc.subject |
parallel corpus |
dc.subject |
web corpus |
dc.subject |
multilingual |
dc.title |
Slovene-English parallel corpus slenWaC 1.0 |
dc.type |
corpus |
metashare.ResourceInfo#ContentInfo.mediaType |
text |
has.files |
yes |
branding |
CLARIN.SI data & tools |
contact.person |
Nikola Ljubešić nljubesi@gmail.com Jožef Stefan Institute |
sponsor |
European Union FP7-PEOPLE-2012-IAPP PIAP-GA-2012-324414 Abu-MaTran euFunds info:eu-repo/grantAgreement/EC/FP7/324414 |
size.info |
27924210 words |
size.info |
718315 sentences |
files.count |
1 |
files.size |
99032558 |