Comparable corpus of parliamentary debates ParlaMint-IL 1.0

Name: Comparable corpus of parliamentary debates ParlaMint-IL 1.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Goldin, Gili; Howell, Nick; Ordan, Noam; Rabinovich, Ella; Wintner, Shuly

Show simple item record

dc.contributor.author	Goldin, Gili
dc.contributor.author	Howell, Nick
dc.contributor.author	Ordan, Noam
dc.contributor.author	Rabinovich, Ella
dc.contributor.author	Wintner, Shuly
dc.date.accessioned	2025-09-01T13:46:30Z
dc.date.available	2025-09-01T13:46:30Z
dc.date.issued	2025-06-07
dc.identifier.uri	http://hdl.handle.net/11356/2032
dc.description	The ParlaMint-IL corpus is the Israeli contribution to the ParlaMint collection of comparable parliamentary corpora (https://www.clarin.eu/parlamint), which contain transcriptions of parliamentary debates of European countries and autonomous regions. The Knesset Corpus follows the ParlaMint encoding guidelines and is fully aligned with version 4.1 of the ParlaMint corpora (cf. http://hdl.handle.net/11356/1912 and http://hdl.handle.net/11356/1911). The corpus comprises transcriptions of all plenary and committee protocols of the Israeli parliament (the Knesset), spanning from 1994 to 2024. It includes more than 12 million speeches and over 400 million words, making it the largest corpus in the ParlaMint collection. All transcriptions are provided in Hebrew, the primary language of Knesset proceedings. The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription. The corpus includes extensive metadata, most importantly on speakers (name, gender, year of birth, MP and minister status, party affiliation), and on their political parties and parliamentary groups (name, coalition/opposition status, and Wikipedia-sourced left-to-right political orientation). The transcriptions are also marked with the subcorpora they belong to, i.e. "reference" (until 2020-01-30), "covid" (from 2020-01-31), and "war" (from 2022-02-24). The corpus TEI/XML schemas are included in the distribution. The corpus is available in two variants, the "plain-text" version (ParlaMint-IL.tgz, corresponding to http://hdl.handle.net/11356/1912) and the linguistically annotated version (ParlaMint-IL.ana.tgz, corresponding to http://hdl.handle.net/11356/1911). The ParlaMint-IL.ana linguistic annotation includes tokenization; sentence segmentation; lemmatisation; Universal Dependencies part-of-speech, morphological features, and syntactic dependencies; and the 4-class CoNLL-2003 named entities. The corpus was annotated with morphological and syntactic annotations by Trankit (https://github.com/nlp-uoregon/trankit) based model, fine-tuned on Knesset data. Named Entity Recognition was performed using dicta-bert (https://huggingface.co/dicta-il/dictabert), a Hebrew NER model. The "plain-text" version (ParlaMint-IL.tgz) contains the canonical TEI/XML files; derived plain-text files; and derived TSV metadata files for the speeches. The linguistically annotated version (ParlaMint-IL.ana.tgz) contains the canonical TEI/XML files with linguistic annotations; derived CoNLL-U files along with TSV metadata of the speeches; and the derived vertical files (with their registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText. The ParlaMint-IL corpus is based on data and annotations described in: Goldin, Gili; Wintner, Shuly; and Rabinovich, Ella. The Knesset Corpus: An Annotated Corpus of Hebrew Parliamentary Proceedings. Language Resources and Evaluation (2025). https://doi.org/10.1007/s10579-025-09833-4
dc.language.iso	heb
dc.publisher	University of Haifa
dc.relation.isreferencedby	https://doi.org/10.1007/s10579-025-09833-4
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus
dc.subject	parliamentary debates
dc.subject	TEI
dc.subject	Parla-CLARIN
dc.subject	Israeli Parliament
dc.title	Comparable corpus of parliamentary debates ParlaMint-IL 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Gili Goldin gili.sommer@gmail.com University of Haifa
sponsor	The Ministry of Science & Technology, Israel 3-17990 The Knesset Corpus nationalFunds
size.info	12434038 utterances
size.info	409434087 words
files.count	2
files.size	36300716106
featuredService.noske	search\|https://www.clarin.si/ske/#dashboard?corpname=parlamint10_il