Show simple item record

 
dc.contributor.author Bañón, Marta
dc.contributor.author Esplà-Gomis, Miquel
dc.contributor.author Forcada, Mikel L.
dc.contributor.author García-Romero, Cristian
dc.contributor.author Kuzman, Taja
dc.contributor.author Ljubešić, Nikola
dc.contributor.author van Noord, Rik
dc.contributor.author Pla Sempere, Leopoldo
dc.contributor.author Ramírez-Sánchez, Gema
dc.contributor.author Rupnik, Peter
dc.contributor.author Suchomel, Vít
dc.contributor.author Toral, Antonio
dc.contributor.author van der Werff, Tobias
dc.contributor.author Zaragoza, Jaume
dc.date.accessioned 2022-04-28T07:37:39Z
dc.date.available 2022-04-28T07:37:39Z
dc.date.issued 2022-04-28
dc.identifier.uri http://hdl.handle.net/11356/1527
dc.description This is a derivative work based on Paracrawl release 9 English-Dutch (https://paracrawl.eu/). This version of the corpus includes a set of probabilities corresponding to the affinity of each segment pair to a specific Digital Service Infrastructure (DSI), which includes Cybersecurity, Electronic Exchange of Social Security Information, E-health, E-justice, Europeana, Online Dispute Resolution, Open Data Portal and Safer Internet. The model that assigned the probabilities is a fine-tuned pre-trained language model (DeBERTa-v3-large), trained on a crawled corpus of English DSI-specific texts. More information is available on the corresponding GitHub page: https://github.com/RikVN/DSI. The rest of the information in the original version of the corpus remained unchanged. Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains.
dc.language.iso eng
dc.language.iso nld
dc.publisher Jožef Stefan Institute
dc.publisher Prompsit
dc.publisher Rijksuniversiteit Groningen
dc.publisher Universitat d'Alacant
dc.rights CC0-No Rights Reserved
dc.rights.uri https://creativecommons.org/publicdomain/zero/1.0/
dc.rights.label PUB
dc.source.uri https://macocu.eu/
dc.subject parallel corpus
dc.subject web corpus
dc.subject multilingual
dc.subject DSI
dc.title DSI-enriched ParaCrawl 9 en-nl corpus
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Miquel Esplà-Gomis mespla@dlsi.ua.es Universitat d’Alacant
sponsor Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 89135870 sentences
files.count 2
files.size 59632662655


 Files in this item

This item is
Publicly Available
and licensed under:
CC0-No Rights Reserved
Icon
Name
ParaCrawl-DSI-en-nl.tmx.gz.0
Size
48.83 GB
Format
application/gzip
Description
Corpus in TMX format, slice 0
MD5
f56aedfceeb8c5c79951d8767cc0453e
 Download file
Icon
Name
ParaCrawl-DSI-en-nl.tmx.gz.1
Size
6.71 GB
Format
application/gzip
Description
Corpus in TMX format, slice 1
MD5
9a64f61593dc774c74d5dafde7dd9a0e
 Download file

Show simple item record