Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Esplà-Gomis, Miquel
dc.contributor.author Ortiz Rojas, Sergio
dc.contributor.author Klubička, Filip
dc.contributor.author Toral, Antonio
dc.date.accessioned 2016-03-09T16:51:44Z
dc.date.available 2016-03-09T16:51:44Z
dc.date.issued 2016-03-09
dc.identifier.uri http://hdl.handle.net/11356/1059
dc.description The srenWaC corpus consists of sentence aligned parallel Serbian-English texts crawled from the .rs top-level domain for Serbia. The corpus was built with Spidextor (https://github.com/abumatran/spidextor), a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext, given the evaluation results on other languages, can be estimated at 74% on the sentence level and 76% on the word level.
dc.language.iso srp
dc.language.iso eng
dc.publisher Jožef Stefan Institute
dc.relation info:eu-repo/grantAgreement/EC/FP7/324414
dc.rights CLARIN.SI User Licence for Internet Corpora
dc.rights.uri https://www.clarin.si/info/wp-content/uploads/2016/01/CLARIN.SI-WAC-2016-01.pdf
dc.rights.label ACA
dc.subject parallel corpus
dc.subject web corpus
dc.subject multilingual
dc.title Serbian-English parallel corpus srenWaC 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nljubesi@gmail.com Jožef Stefan Institute
sponsor European Union FP7-PEOPLE-2012-IAPP PIAP-GA-2012-324414 Abu-MaTran euFunds info:eu-repo/grantAgreement/EC/FP7/324414
size.info 23139804 words
size.info 534682 sentences
files.count 1
files.size 74389140


 Files in this item

This item is
Academic Use
and licensed under:
CLARIN.SI User Licence for Internet Corpora
Attribution Required Noncommercial
Icon
Name
srenwac_v1.0.tmx.tgz
Size
70.94 MB
Format
Unknown
Description
TMX as gzipped tar
MD5
90e15d9587c7dd892b89edc079d35c9d
 Download file

Show simple item record