Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Esplà-Gomis, Miquel
dc.contributor.author Ortiz Rojas, Sergio
dc.contributor.author Klubička, Filip
dc.contributor.author Toral, Antonio
dc.date.accessioned 2016-03-10T15:21:18Z
dc.date.available 2016-03-10T15:21:18Z
dc.date.issued 2016-03-10
dc.identifier.uri http://hdl.handle.net/11356/1061
dc.description The slenWaC corpus version 1.0 consists of parallel Slovene-English texts crawled from the .si top-level domain for Slovenia. The corpus was built with Spidextor (https://github.com/abumatran/spidextor), a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext on the segment level is around 67% and on the word level around 68%.
dc.language.iso slv
dc.language.iso eng
dc.publisher Jožef Stefan Institute
dc.relation info:eu-repo/grantAgreement/EC/FP7/324414
dc.rights CLARIN.SI User Licence for Internet Corpora
dc.rights.uri https://www.clarin.si/info/wp-content/uploads/2016/01/CLARIN.SI-WAC-2016-01.pdf
dc.rights.label ACA
dc.subject parallel corpus
dc.subject web corpus
dc.subject multilingual
dc.title Slovene-English parallel corpus slenWaC 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nljubesi@gmail.com Jožef Stefan Institute
sponsor European Union FP7-PEOPLE-2012-IAPP PIAP-GA-2012-324414 Abu-MaTran euFunds info:eu-repo/grantAgreement/EC/FP7/324414
size.info 27924210 words
size.info 718315 sentences
files.count 1
files.size 99032558


 Files in this item

This item is
Academic Use
and licensed under:
CLARIN.SI User Licence for Internet Corpora
Attribution Required Noncommercial
Icon
Name
slenwac_v1.0.tmx.tgz
Size
94.44 MB
Format
Unknown
MD5
be96214288614fde742d59be52d71cf7
 Download file

Show simple item record