Show simple item record

 
dc.contributor.author Bañón, Marta
dc.contributor.author Esplà-Gomis, Miquel
dc.contributor.author Forcada, Mikel L.
dc.contributor.author García-Romero, Cristian
dc.contributor.author Kuzman, Taja
dc.contributor.author Ljubešić, Nikola
dc.contributor.author van Noord, Rik
dc.contributor.author Pla Sempere, Leopoldo
dc.contributor.author Ramírez-Sánchez, Gema
dc.contributor.author Rupnik, Peter
dc.contributor.author Suchomel, Vít
dc.contributor.author Toral, Antonio
dc.contributor.author van der Werff, Tobias
dc.contributor.author Zaragoza, Jaume
dc.date.accessioned 2022-04-28T07:31:59Z
dc.date.available 2022-04-28T07:31:59Z
dc.date.issued 2022-04-25
dc.identifier.uri http://hdl.handle.net/11356/1526
dc.description This is a derivative work based on Paracrawl release 9 English-Spanish (https://paracrawl.eu/). This version of the corpus includes a set of probabilities corresponding to the affinity of each segment pair to a specific Digital Service Infrastructure (DSI), which includes Cybersecurity, Electronic Exchange of Social Security Information, E-health, E-justice, Europeana, Online Dispute Resolution, Open Data Portal and Safer Internet. The model that assigned the probabilities is a fine-tuned pre-trained language model (DeBERTa-v3-large), trained on a crawled corpus of English DSI-specific texts. More information is available on the corresponding GitHub page: https://github.com/RikVN/DSI. The rest of the information in the original version of the corpus remained unchanged. Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains.
dc.language.iso spa
dc.language.iso eng
dc.publisher Jožef Stefan Institute
dc.publisher Prompsit
dc.publisher Rijksuniversiteit Groningen
dc.publisher Universitat d'Alacant
dc.rights CC0-No Rights Reserved
dc.rights.uri https://creativecommons.org/publicdomain/zero/1.0/
dc.rights.label PUB
dc.source.uri https://macocu.eu/
dc.subject parallel corpus
dc.subject web corpus
dc.subject multilingual
dc.subject DSI
dc.title DSI-enriched ParaCrawl 9 en-es corpus
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Miquel Esplà-Gomis mespla@dlsi.ua.es Universitat d’Alacant
sponsor Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 269394967 sentences
files.count 4
files.size 189854986897


 Files in this item

This item is
Publicly Available
and licensed under:
CC0-No Rights Reserved
Icon
Name
ParaCrawl-DSI-en-es.tmx.gz.0
Size
48.83 GB
Format
Unknown
Description
Corpus in TMX format, slice 0
MD5
26364d231ae0a89aad8167b07fd5a9dd
 Download file
Icon
Name
ParaCrawl-DSI-en-es.tmx.gz.1
Size
48.83 GB
Format
Unknown
Description
Corpus in TMX format, slice 1
MD5
41308bbe878e5b7125c82c5b33b9f1b8
 Download file
Icon
Name
ParaCrawl-DSI-en-es.tmx.gz.2
Size
48.83 GB
Format
Unknown
Description
Corpus in TMX format, slice 2
MD5
f5ec54e38bc2bb23b7555a58eff37a30
 Download file
Icon
Name
ParaCrawl-DSI-en-es.tmx.gz.3
Size
30.33 GB
Format
Unknown
Description
Corpus in TMX format, slice 3
MD5
c8481441cc591a292859c631b48a4198
 Download file

Show simple item record