Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Esplà-Gomis, Miquel
dc.contributor.author Ortiz Rojas, Sergio
dc.contributor.author Klubička, Filip
dc.contributor.author Toral, Antonio
dc.date.accessioned 2016-03-09T17:05:19Z
dc.date.available 2016-03-09T17:05:19Z
dc.date.issued 2016-03-09
dc.identifier.uri http://hdl.handle.net/11356/1060
dc.description The fienWaC corpus version 1.0 consists of parallel Finnish-English texts crawled from the .fi top-level domain for Finland. The corpus was built with Spidextor (https://github.com/abumatran/spidextor), a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext, given the evaluation results on other languages, can be estimated at 74% on the segment level and 76% on the word level.
dc.language.iso fin
dc.language.iso eng
dc.publisher Jožef Stefan Institute
dc.relation info:eu-repo/grantAgreement/EC/FP7/324414
dc.rights CLARIN.SI User Licence for Internet Corpora
dc.rights.uri https://www.clarin.si/info/wp-content/uploads/2016/01/CLARIN.SI-WAC-2016-01.pdf
dc.rights.label ACA
dc.subject parallel corpus
dc.subject web corpus
dc.subject multilingual
dc.title Finnish-English parallel corpus fienWaC 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nljubesi@gmail.com Jožef Stefan Institute
sponsor European Union FP7-PEOPLE-2012-IAPP PIAP-GA-2012-324414 Abu-MaTran euFunds info:eu-repo/grantAgreement/EC/FP7/324414
size.info 77048083 words
size.info 2866574 sentences
files.count 1
files.size 297448076


 Files in this item

This item is
Academic Use
and licensed under:
CLARIN.SI User Licence for Internet Corpora
Attribution Required Noncommercial
Icon
Name
fienwac_v1.0.tmx.tgz
Size
283.67 MB
Format
Unknown
MD5
0e702cf28c098fb72a5e9f815170b519
 Download file

Show simple item record