dc.contributor.author | Kuzman, Taja |
dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Kopp, Matyáš |
dc.contributor.author | Ogrodniczuk, Maciej |
dc.contributor.author | Osenova, Petya |
dc.contributor.author | Rayson, Paul |
dc.contributor.author | Vidler, John |
dc.contributor.author | Agerri, Rodrigo |
dc.contributor.author | Agirrezabal, Manex |
dc.contributor.author | Agnoloni, Tommaso |
dc.contributor.author | Aires, José |
dc.contributor.author | Albini, Monica |
dc.contributor.author | Alkorta, Jon |
dc.contributor.author | Antiba-Cartazo, Iván |
dc.contributor.author | Arrieta, Ekain |
dc.contributor.author | Barcala, Mario |
dc.contributor.author | Bardanca, Daniel |
dc.contributor.author | Barkarson, Starkaður |
dc.contributor.author | Bartolini, Roberto |
dc.contributor.author | Battistoni, Roberto |
dc.contributor.author | Bel, Nuria |
dc.contributor.author | Bonet Ramos, Maria del Mar |
dc.contributor.author | Calzada Pérez, María |
dc.contributor.author | Cardoso, Aida |
dc.contributor.author | Çöltekin, Çağrı |
dc.contributor.author | Coole, Matthew |
dc.contributor.author | Darģis, Roberts |
dc.contributor.author | de Does, Jesse |
dc.contributor.author | de Libano, Ruben |
dc.contributor.author | Depoorter, Griet |
dc.contributor.author | Depuydt, Katrien |
dc.contributor.author | Diwersy, Sascha |
dc.contributor.author | Dodé, Réka |
dc.contributor.author | Fernandez, Kike |
dc.contributor.author | Fernández Rei, Elisa |
dc.contributor.author | Frontini, Francesca |
dc.contributor.author | Garcia, Marcos |
dc.contributor.author | García Díaz, Noelia |
dc.contributor.author | García Louzao, Pedro |
dc.contributor.author | Gavriilidou, Maria |
dc.contributor.author | Gkoumas, Dimitris |
dc.contributor.author | Grigorov, Ilko |
dc.contributor.author | Grigorova, Vladislava |
dc.contributor.author | Haltrup Hansen, Dorte |
dc.contributor.author | Iruskieta, Mikel |
dc.contributor.author | Jarlbrink, Johan |
dc.contributor.author | Jelencsik-Mátyus, Kinga |
dc.contributor.author | Jongejan, Bart |
dc.contributor.author | Kahusk, Neeme |
dc.contributor.author | Kirnbauer, Martin |
dc.contributor.author | Kryvenko, Anna |
dc.contributor.author | Ligeti-Nagy, Noémi |
dc.contributor.author | Luxardo, Giancarlo |
dc.contributor.author | Magariños, Carmen |
dc.contributor.author | Magnusson, Måns |
dc.contributor.author | Marchetti, Carlo |
dc.contributor.author | Marx, Maarten |
dc.contributor.author | Meden, Katja |
dc.contributor.author | Mendes, Amália |
dc.contributor.author | Mochtak, Michal |
dc.contributor.author | Mölder, Martin |
dc.contributor.author | Montemagni, Simonetta |
dc.contributor.author | Navarretta, Costanza |
dc.contributor.author | Nitoń, Bartłomiej |
dc.contributor.author | Norén, Fredrik Mohammadi |
dc.contributor.author | Nwadukwe, Amanda |
dc.contributor.author | Ojsteršek, Mihael |
dc.contributor.author | Pančur, Andrej |
dc.contributor.author | Papavassiliou, Vassilis |
dc.contributor.author | Pereira, Rui |
dc.contributor.author | Pérez Lago, María |
dc.contributor.author | Piperidis, Stelios |
dc.contributor.author | Pirker, Hannes |
dc.contributor.author | Pisani, Marilina |
dc.contributor.author | Pol, Henk van der |
dc.contributor.author | Prokopidis, Prokopis |
dc.contributor.author | Quochi, Valeria |
dc.contributor.author | Regueira, Xosé Luís |
dc.contributor.author | Rii, Andriana |
dc.contributor.author | Rudolf, Michał |
dc.contributor.author | Ruisi, Manuela |
dc.contributor.author | Rupnik, Peter |
dc.contributor.author | Schopper, Daniel |
dc.contributor.author | Simov, Kiril |
dc.contributor.author | Sinikallio, Laura |
dc.contributor.author | Skubic, Jure |
dc.contributor.author | Tamper, Minna |
dc.contributor.author | Tungland, Lars Magne |
dc.contributor.author | Tuominen, Jouni |
dc.contributor.author | van Heusden, Ruben |
dc.contributor.author | Varga, Zsófia |
dc.contributor.author | Vázquez Abuín, Marta |
dc.contributor.author | Venturi, Giulia |
dc.contributor.author | Vidal Miguéns, Adrián |
dc.contributor.author | Vider, Kadri |
dc.contributor.author | Vivel Couso, Ainhoa |
dc.contributor.author | Vladu, Adina Ioana |
dc.contributor.author | Wissik, Tanja |
dc.contributor.author | Yrjänäinen, Väinö |
dc.contributor.author | Zevallos, Rodolfo |
dc.contributor.author | Fišer, Darja |
dc.date.accessioned | 2024-06-04T18:48:58Z |
dc.date.available | 2024-06-04T18:48:58Z |
dc.date.issued | 2024-06-03 |
dc.identifier.uri | http://hdl.handle.net/11356/1910 |
dc.description | ParlaMint-en.ana 4.1 is the English machine translation of the ParlaMint.ana 4.1 (http://hdl.handle.net/11356/1911) set of corpora of parliamentary debates across Europe. The translation is linguistically annotated similarly to the original language corpora (but without UD syntax), and with the addition of USAS semantic tags (https://ucrel.lancs.ac.uk/usas/). Because of the addition of semantic tags the UK corpus (ParlaMint-GB) is also included. The translation to English was done with EasyNMT (https://github.com/UKPLab/EasyNMT) using OPUS-MT models (https://github.com/Helsinki-NLP/Opus-MT). Machine translation was done on the sentence level, and includes both speeches and transcriber notes, including headings. Note that corpus metadata is mostly available both in the source language and in English. The linguistic annotation of the speeches, i.e. tokenisation, tagging with UD PoS and morphological features, lemmatisation, and NER annotation was done with Stanza (https://stanfordnlp.github.io/stanza/) using the conll03 model (4 classes). The annotation of MWEs (phrases) and tokens with USAS tags was done with pyMusas (https://github.com/ucrel/pymusas). Note that the English in the corpora contains typical NMT errors, including factual errors even when high fluency is achieved, and any use of this corpus should take the machine translation limitations into account. The files associated with this entry include the machine translated and linguistically annotated corpora in several formats: the corpora in the canonical ParlaMint TEI XML encoding; the corpora in the derived vertical format (for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText); and the corpora in the CoNLL-U format with TSV speech metadata. The CoNLL-U files include pyMusas USAS tags. Also included is the 4.1 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint and the log files produced in the process of building the corpora for this release. The log files show e.g. known errors in the corpora, while more information about known problems is available in the (open) issues at the GitHub repository of the project. As opposed to the previous version 4.0, this version fixes a number of bugs and restructures the ParlaMint GitHub repository. The DK corpus has now speeches also marked with topics. The PT corpus has been extended to 2024-03 and the UA corpus to 2023-11, where UA also has improved language marking (uk vs. ru) on segments. |
dc.language.iso | eng |
dc.publisher | CLARIN ERIC |
dc.relation.isreferencedby | https://doi.org/10.1007/s10579-024-09798-w |
dc.relation.replaces | http://hdl.handle.net/11356/1864 |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.clarin.eu/content/parlamint |
dc.subject | Parla-CLARIN |
dc.subject | parliamentary debates |
dc.subject | COVID-19 |
dc.subject | TEI |
dc.subject | Bulgarian Parliament |
dc.subject | Croatian Parliament |
dc.subject | Polish Parliament |
dc.subject | Slovenian Parliament |
dc.subject | Czech Parliament |
dc.subject | Icelandic Parliament |
dc.subject | Belgian Parliament |
dc.subject | Danish Parliament |
dc.subject | Dutch Parliament |
dc.subject | Turkish Parliament |
dc.subject | Italian Parliament |
dc.subject | Hungarian Parliament |
dc.subject | Latvian Parliament |
dc.subject | French Parliament |
dc.subject | Bosnian Parliament |
dc.subject | Catalonian Parliament |
dc.subject | Galician Parliament |
dc.subject | Greek Parliament |
dc.subject | Norwegian Parliament |
dc.subject | Portugese Parliament |
dc.subject | Serbian Parliament |
dc.subject | Swedish Parliament |
dc.subject | Ukrainian Parliament |
dc.subject | Austrian Parliament |
dc.subject | Estonian Parliament |
dc.subject | Spanish Parliament |
dc.subject | Finnish Parliament |
dc.subject | Basque Parliament |
dc.subject | British Parliament |
dc.title | Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 4.1 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
demo.uri | https://github.com/clarin-eric/ParlaMint/ |
contact.person | Matyáš Kopp kopp@ufal.mff.cuni.cz Charles University |
contact.person | Taja Kuzman taja.kuzman@ijs.si Jožef Stefan Institute |
contact.person | Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
sponsor | CLARIN ERIC - ParlaMint: Towards Comparable Parliamentary Corpora Other |
sponsor | Austrian Academy of Sciences - ÖAW nationalFunds |
sponsor | European Commission POIR.04.02.00-00C002/19 European Regional Development Fund as a part of the 2014-2020 Smart Growth Operational Programme, CLARIN – Common Language Resources and Technology Infrastructure Other |
sponsor | Dutch Language Institute - - nationalFunds |
sponsor | Ministry of Education, Youth and Sports of the Czech Republic LM2023062 LINDAT/CLARIAH-CZ: Digital Research Infrastructure for Language Technologies, Arts and Humanities nationalFunds |
sponsor | Department of Nordic Studies and Linguistics (NorS), University of Copenhagen CLARIN-DK CLARIN-DK nationalFunds |
sponsor | Galician Language Institute, University of Santiago de Compostela - - ownFunds |
sponsor | Xunta de Galicia - University of Santiago de Compostela 2021-CP080 Nós: Galician in the society and economy of artificial intelligence (2021-CP080), agreement between Xunta de Galicia and the University of Santiago de Compostela nationalFunds |
sponsor | Hungarian Research Centre for Linguistics - - nationalFunds |
sponsor | National Library of Norway - - nationalFunds |
sponsor | Institute of Computer Science, Polish Academy of Sciences - statutory research nationalFunds |
sponsor | Polish Ministry of Education and Science 2022/WK/09 National contribution to CLARIN ERIC – European Research Infrastructure Consortium: Common Language Resources and Technology Infrastructure 2022–2023 (CLARIN Q) nationalFunds |
sponsor | Fundação para a Ciência e a Tecnologia UIDP/00214/2020 - nationalFunds |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | Nederlandse Organisatie voor Wetenschappelijk Onderwijs CISC.CC.016 Access to City Councils using Exploratory Search Systems nationalFunds |
sponsor | Bulgarian Ministry of Education and Science DO1-301/17.12.21 Bulgarian National Interdisciplinary Research e-Infrastructure for Resources and Technologies in favor of the Bulgarian Language and Cultural Heritage, part of the EU infrastructures CLARIN and DARIAH nationalFunds |
sponsor | Institute for Language and Speech Processing / ATHENA RC - - nationalFunds |
sponsor | ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds |
sponsor | The Árni Magnsússon Institute for Icelandic Studies - - ownFunds |
sponsor | Slovenian Research Agency (ARRS) P6-0436 Basic national research program 'Digital Humanities' (2022-2027) nationalFunds |
sponsor | ARRS (Slovenian Research Agency) N6-0099 Flemish-Slovenian bilateral basic research project ‘Linguistic landscape of hate speech online’ (2019-2023) nationalFunds |
sponsor | ARRS (Slovenian Research Agency) N6-0288 the MSCA Seal of Excellence postdoctoral project 'The Changing Discursive Semantics of EU Representations' (2022-2024) nationalFunds |
sponsor | Ministry of Science and Innovation of Spain - - nationalFunds |
sponsor | HiTZ - Ixa Group (UPV/EHU) - - Other |
size.info | 8132022 utterances |
size.info | 1364870493 words |
files.count | 31 |
files.size | 57298306705 |
featuredService.noske | search MTed corpora|https://www.clarin.si/ske/#dashboard?corpname=parlamint41_xx_en |
featuredService.noske | search original corpora|https://www.clarin.si/ske/#dashboard?corpname=parlamint40_xx |
Datoteke v tem vnosu
To je vnos
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
z licenco:Creative Commons - Attribution 4.0 International (CC BY 4.0)



- Ime
- ParlaMint-AT-en.ana.tgz
- Velikost
- 2.67 GB
- Format
- Neznano
- Opis
- Austrian corpus
- MD5
- a58f626ca2cba043052f34a5daea74f7

- Ime
- ParlaMint-BA-en.ana.tgz
- Velikost
- 826.67 MB
- Format
- Neznano
- Opis
- Bosnian corpus
- MD5
- 8b9d20dafe102c89800f7e3a5aadb2bb

- Ime
- ParlaMint-BE-en.ana.tgz
- Velikost
- 1.7 GB
- Format
- Neznano
- Opis
- Belgian corpus
- MD5
- b89acc55cc8703790177d4e1147b6fc1

- Ime
- ParlaMint-BG-en.ana.tgz
- Velikost
- 1.12 GB
- Format
- Neznano
- Opis
- Bulgarian corpus
- MD5
- ed2a385aee91e2d0e21fef76d426e237

- Ime
- ParlaMint-CZ-en.ana.tgz
- Velikost
- 1.47 GB
- Format
- Neznano
- Opis
- Czech corpus
- MD5
- 531d048ba755a1e5d8c4e615982722a5

- Ime
- ParlaMint-DK-en.ana.tgz
- Velikost
- 1.6 GB
- Format
- Neznano
- Opis
- Danish corpus
- MD5
- 68b4e212677a2aedca775ff6a5b7a123

- Ime
- ParlaMint-EE-en.ana.tgz
- Velikost
- 1.22 GB
- Format
- Neznano
- Opis
- Estonian corpus
- MD5
- 07e5873214aa745f849c4e714cd73679

- Ime
- ParlaMint-ES-en.ana.tgz
- Velikost
- 784.8 MB
- Format
- Neznano
- Opis
- Spanish corpus
- MD5
- 935a2a93a2e8ccabc24e8ece5a700295

- Ime
- ParlaMint-ES-CT-en.ana.tgz
- Velikost
- 634.32 MB
- Format
- Neznano
- Opis
- Catalan corpus
- MD5
- c5d8a52b6a98c6233317cfed9b3004d7

- Ime
- ParlaMint-ES-GA-en.ana.tgz
- Velikost
- 747.34 MB
- Format
- Neznano
- Opis
- Galician corpus
- MD5
- 2ccfba306f8078867286334a90ad1119

- Ime
- ParlaMint-ES-PV-en.ana.tgz
- Velikost
- 563.72 MB
- Format
- Neznano
- Opis
- Basque corpus
- MD5
- 2d2a6f0a4a2a82a1b9678ce348c08ac7

- Ime
- ParlaMint-FI-en.ana.tgz
- Velikost
- 790.64 MB
- Format
- Neznano
- Opis
- Finnish corpus
- MD5
- 4d69333bad7bde7f4647e468c2635b75

- Ime
- ParlaMint-FR-en.ana.tgz
- Velikost
- 1.88 GB
- Format
- Neznano
- Opis
- French corpus
- MD5
- 46abf342dd8133eb6943ce44c9ef8fbf

- Ime
- ParlaMint-GB-en.ana.tgz
- Velikost
- 4.86 GB
- Format
- Neznano
- Opis
- British corpus
- MD5
- 757886ea8fd220c473a238de301f4171

- Ime
- ParlaMint-GR-en.ana.tgz
- Velikost
- 2.14 GB
- Format
- Neznano
- Opis
- Greek corpus
- MD5
- 7609ed5e86552b11f4fb1a2d87b5c540

- Ime
- ParlaMint-HR-en.ana.tgz
- Velikost
- 3.86 GB
- Format
- Neznano
- Opis
- Croatian corpus
- MD5
- 8146dcd9228fd6154493cfd751727afd

- Ime
- ParlaMint-HU-en.ana.tgz
- Velikost
- 1.49 GB
- Format
- Neznano
- Opis
- Hungarian corpus
- MD5
- 4cc90667be72bd56bc85135c68d7cb78

- Ime
- ParlaMint-IS-en.ana.tgz
- Velikost
- 1.27 GB
- Format
- Neznano
- Opis
- Icelandic corpus
- MD5
- 57d899285eb992b6cf1da26b5f90275c

- Ime
- ParlaMint-IT-en.ana.tgz
- Velikost
- 1.33 GB
- Format
- Neznano
- Opis
- Italian corpus
- MD5
- 60b5c7b51d3a0ff78d3788ee9051d2e9

- Ime
- ParlaMint-LV-en.ana.tgz
- Velikost
- 491.39 MB
- Format
- Neznano
- Opis
- Latvian corpus
- MD5
- 5f025014b84cbc77b39c24fa5691146a

- Ime
- ParlaMint-NL-en.ana.tgz
- Velikost
- 2.65 GB
- Format
- Neznano
- Opis
- Dutch corpus
- MD5
- 46e88465bcbed4d17d693164565f4522

- Ime
- ParlaMint-NO-en.ana.tgz
- Velikost
- 3.84 GB
- Format
- Neznano
- Opis
- Norwegian corpus
- MD5
- 8cbca2c1259add2689066344cc96b00b

- Ime
- ParlaMint-PL-en.ana.tgz
- Velikost
- 1.68 GB
- Format
- Neznano
- Opis
- Polish corpus
- MD5
- 2f4889dd844f0605041bff0ec05f757f

- Ime
- ParlaMint-PT-en.ana.tgz
- Velikost
- 981.91 MB
- Format
- Neznano
- Opis
- Portuguese corpus
- MD5
- ec0aa88a221fb508b26930cb4e4cb5c3

- Ime
- ParlaMint-RS-en.ana.tgz
- Velikost
- 3.63 GB
- Format
- Neznano
- Opis
- Serbian corpus
- MD5
- 8337bb60e6c6e8335b9669a2a58a4a4b

- Ime
- ParlaMint-SE-en.ana.tgz
- Velikost
- 1.36 GB
- Format
- Neznano
- Opis
- Swedish corpus
- MD5
- 3fb4a80780a806f8f4baaf900db332d7

- Ime
- ParlaMint-SI-en.ana.tgz
- Velikost
- 3.19 GB
- Format
- Neznano
- Opis
- Slovenian corpus
- MD5
- 8c3180a55187659fbd75245a08265c7d

- Ime
- ParlaMint-TR-en.ana.tgz
- Velikost
- 2.54 GB
- Format
- Neznano
- Opis
- Turkish corpus
- MD5
- 1a3b308c7d228ff7410e7cdf7483cdd6

- Ime
- ParlaMint-UA-en.ana.tgz
- Velikost
- 2.13 GB
- Format
- Neznano
- Opis
- Ukrainian corpus
- MD5
- b0657d56e1441c2aa1e454c204be2efa

- Ime
- ParlaMint-4.1.tgz
- Velikost
- 18.77 MB
- Format
- Neznano
- Opis
- https://github.com/clarin-eric/ParlaMint/releases/tag/v4.1 (samples, schemas, scripts)
- MD5
- 91929b37c965a5c6591b1cf2eda271ea

- Ime
- ParlaMint-4.1-Logs.tgz
- Velikost
- 23.36 MB
- Format
- Neznano
- Opis
- Build log files of the corpora
- MD5
- 4c2f2b7d5394eceab9f7dbf5a217b55a