Corpus-grounded evaluation dataset for grammatical question answering GramQA 1.0

Name: Corpus-grounded evaluation dataset for grammatical question answering GramQA 1.0
License: https://creativecommons.org/licenses/by/4.0/

Terčon, Luka; Dobrovoljc, Kaja; Klemen, Matej; Arčon, Tjaša; Robnik-Šikonja, Marko

Prikaži enostavni zapis vnosa

dc.contributor.author	Terčon, Luka
dc.contributor.author	Dobrovoljc, Kaja
dc.contributor.author	Klemen, Matej
dc.contributor.author	Arčon, Tjaša
dc.contributor.author	Robnik-Šikonja, Marko
dc.date.accessioned	2026-02-25T18:16:54Z
dc.date.available	2026-02-25T18:16:54Z
dc.date.issued	2026-02-24
dc.identifier.uri	http://hdl.handle.net/11356/2086
dc.description	The Corpus-grounded evaluation dataset for grammatical question answering (GramQA) consists of 13 grammatical questions inspired by WALS, the World Atlas of Language Structures (https://wals.info/), focusing on word order variation across different syntactic constructions (e.g., the typical order of subject, object, and verb in a language). For each question, the dataset provides ground truth values for 179 languages based on Universal Dependencies (https://universaldependencies.org/) corpora, which can be used for cross-linguistic word order comparison and evaluation of model predictions against corpus evidence. The dataset was originally developed as an evaluation benchmark for an agentic LLM-based grammatical analysis system (i.e. the UD-Agent, described in a separate paper (https://arxiv.org/abs/2512.00214)), but is released as a standalone resource for broader reuse. For every question–language pair, the dataset includes (i) the dominant word order pattern (reported as the most frequent attested value in the corpus) and (ii) the full distribution of all attested word order patterns with their relative frequencies. The ground truth values were obtained automatically by applying a series of Python scripts developed by the authors, implementing rule-based extraction procedures over test portions of the UD treebanks (v2.16). The scripts can be accessed at a separate GitHub repository (https://github.com/matejklemen/ud_llm/). Files included: - udagent_eval_data.jsonl: A JSON Lines file containing 1,899 entries (one per feature-language pair; only the feature-language pairs for which at least one valid result was returned by the Python scripts are included). Each entry consists of the WALS feature ID, language information, and the corresponding ground truth value derived from UD data. Each entry contains information about both dominant word order pattern (dubbed the "short answer") as well as the distribution across all possible orders for the associated feature. - udagent_eval_metadata.json: A JSON file with information about the included languages, the UD treebanks used to obtain the ground truth values for each language, the particular question associated with each WALS feature, and set of possible values for each feature.
dc.language.iso	mul
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.publisher	Centre for Language Resources and Technologies, University of Ljubljana
dc.relation	info:eu-repo/grantAgreement/EC/HE/101186647
dc.relation.isreferencedby	https://doi.org/10.48550/arXiv.2512.00214
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	PUB
dc.subject	question answering
dc.subject	evaluation dataset
dc.subject	linguistics
dc.subject	agentic AI
dc.subject	grammatical analysis
dc.subject	universal dependencies
dc.title	Corpus-grounded evaluation dataset for grammatical question answering GramQA 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Luka Terčon luka.tercon@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor	ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	European Union EC/HE/101186647 AI4DH - Centre of Excellence in Artificial Intelligence for Digital Humanities euFunds info:eu-repo/grantAgreement/EC/HE/101186647
size.info	1899 entries
files.count	1
files.size	38958