Prikaži enostavni zapis vnosa

 
dc.contributor.author Goldin, Gili
dc.contributor.author Howell, Nick
dc.contributor.author Ordan, Noam
dc.contributor.author Rabinovich, Ella
dc.contributor.author Wintner, Shuly
dc.date.accessioned 2025-09-01T13:46:30Z
dc.date.available 2025-09-01T13:46:30Z
dc.date.issued 2025-06-07
dc.identifier.uri http://hdl.handle.net/11356/2032
dc.description The ParlaMint-IL corpus is the Israeli contribution to the ParlaMint collection of comparable parliamentary corpora (https://www.clarin.eu/parlamint), which contain transcriptions of parliamentary debates of European countries and autonomous regions. The Knesset Corpus follows the ParlaMint encoding guidelines and is fully aligned with version 4.1 of the ParlaMint corpora (cf. http://hdl.handle.net/11356/1912 and http://hdl.handle.net/11356/1911). The corpus comprises transcriptions of all plenary and committee protocols of the Israeli parliament (the Knesset), spanning from 1994 to 2024. It includes more than 12 million speeches and over 400 million words, making it the largest corpus in the ParlaMint collection. All transcriptions are provided in Hebrew, the primary language of Knesset proceedings. The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription. The corpus includes extensive metadata, most importantly on speakers (name, gender, year of birth, MP and minister status, party affiliation), and on their political parties and parliamentary groups (name, coalition/opposition status, and Wikipedia-sourced left-to-right political orientation). The transcriptions are also marked with the subcorpora they belong to, i.e. "reference" (until 2020-01-30), "covid" (from 2020-01-31), and "war" (from 2022-02-24). The corpus TEI/XML schemas are included in the distribution. The corpus is available in two variants, the "plain-text" version (ParlaMint-IL.tgz, corresponding to http://hdl.handle.net/11356/1912) and the linguistically annotated version (ParlaMint-IL.ana.tgz, corresponding to http://hdl.handle.net/11356/1911). The ParlaMint-IL.ana linguistic annotation includes tokenization; sentence segmentation; lemmatisation; Universal Dependencies part-of-speech, morphological features, and syntactic dependencies; and the 4-class CoNLL-2003 named entities. The corpus was annotated with morphological and syntactic annotations by Trankit (https://github.com/nlp-uoregon/trankit) based model, fine-tuned on Knesset data. Named Entity Recognition was performed using dicta-bert (https://huggingface.co/dicta-il/dictabert), a Hebrew NER model. The "plain-text" version (ParlaMint-IL.tgz) contains the canonical TEI/XML files; derived plain-text files; and derived TSV metadata files for the speeches. The linguistically annotated version (ParlaMint-IL.ana.tgz) contains the canonical TEI/XML files with linguistic annotations; derived CoNLL-U files along with TSV metadata of the speeches; and the derived vertical files (with their registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText. The ParlaMint-IL corpus is based on data and annotations described in: Goldin, Gili; Wintner, Shuly; and Rabinovich, Ella. The Knesset Corpus: An Annotated Corpus of Hebrew Parliamentary Proceedings. Language Resources and Evaluation (2025). https://doi.org/10.1007/s10579-025-09833-4
dc.language.iso heb
dc.publisher University of Haifa
dc.relation.isreferencedby https://doi.org/10.1007/s10579-025-09833-4
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus
dc.subject parliamentary debates
dc.subject TEI
dc.subject Parla-CLARIN
dc.subject Israeli Parliament
dc.title Comparable corpus of parliamentary debates ParlaMint-IL 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Gili Goldin gili.sommer@gmail.com University of Haifa
sponsor The Ministry of Science & Technology, Israel 3-17990 The Knesset Corpus nationalFunds
size.info 12434038 utterances
size.info 409434087 words
files.count 2
files.size 36300716106
featuredService.noske search|https://www.clarin.si/ske/#dashboard?corpname=parlamint10_il


 Datoteke v tem vnosu

Icon
Ime
ParlaMint-IL.tgz
Velikost
2.52 GB
Format
Neznano
Opis
"Plain text" corpus
MD5
1e5afadbdfcd8a7ed25ad2a38d37013d
 Prenesi datoteko
Icon
Ime
ParlaMint-IL.ana.tgz
Velikost
31.28 GB
Format
Neznano
Opis
Linguistically annotated corpus
MD5
dbfc0ca511df1686f9fdb1b2c304483d
 Prenesi datoteko

Prikaži enostavni zapis vnosa