Dataset of Slovene medical texts PoVeJMo-VeMo-Med 1.0

Name: Dataset of Slovene medical texts PoVeJMo-VeMo-Med 1.0
License: https://creativecommons.org/licenses/by/4.0/

Malenšek, Miha; Bajec, Marko

Show simple item record

dc.contributor.author	Malenšek, Miha
dc.contributor.author	Bajec, Marko
dc.date.accessioned	2024-11-06T16:16:45Z
dc.date.available	2024-11-06T16:16:45Z
dc.date.issued	2024-09-25
dc.identifier.uri	http://hdl.handle.net/11356/1983
dc.description	PoVeJMo-VeMo-Med is a dataset containing Slovene medical texts. The bulk of it is comprised of instructions of use for different prescribed drugs. The texts were extracted from the Slovene Central Drug Database (Centralna baza zdravil; http://www.cbz.si/), with a minority of documents from the National Institute of Public Health (Nacionalni inštitut za javno zdravje; https://nijz.si/). The documents were converted from PDF-files to text format. The dataset can be used to fine-tune large language models for the medical domain. Version 1.0 contains two subversions of the corpus: the original (with 17,701 texts) and the deduplicated version (with 5,841 texts), in which duplicate texts have been removed. Please note that this dataset was also the basis for the automatic generation of the Slovene instruction-following dataset for large language models GaMS-Instruct-MED 1.0 (http://hdl.handle.net/11356/1982). For more information on how the two are related, please consult the entry for GaMS-Instruct-MED 1.0.
dc.language.iso	slv
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.publisher	VITASIS, d.o.o.
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.cjvt.si/povejmo/en/project/
dc.subject	specialised corpus
dc.subject	medical texts
dc.subject	large language models
dc.title	Dataset of Slovene medical texts PoVeJMo-VeMo-Med 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Miha Malenšek miha.malensek@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor	ARIS (Slovenian Research and Innovation Agency) NOO PoVeJMo research project (Adaptive Natural Language Processing with Large Language Models) nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info	17701 texts
files.count	1
files.size	171707396