Show simple item record

 
dc.contributor.author Malenšek, Miha
dc.contributor.author Bajec, Marko
dc.date.accessioned 2024-11-06T16:16:45Z
dc.date.available 2024-11-06T16:16:45Z
dc.date.issued 2024-09-25
dc.identifier.uri http://hdl.handle.net/11356/1983
dc.description PoVeJMo-VeMo-Med is a dataset containing Slovene medical texts. The bulk of it is comprised of instructions of use for different prescribed drugs. The texts were extracted from the Slovene Central Drug Database (Centralna baza zdravil; http://www.cbz.si/), with a minority of documents from the National Institute of Public Health (Nacionalni inštitut za javno zdravje; https://nijz.si/). The documents were converted from PDF-files to text format. The dataset can be used to fine-tune large language models for the medical domain. Version 1.0 contains two subversions of the corpus: the original (with 17,701 texts) and the deduplicated version (with 5,841 texts), in which duplicate texts have been removed. Please note that this dataset was also the basis for the automatic generation of the Slovene instruction-following dataset for large language models GaMS-Instruct-MED 1.0 (http://hdl.handle.net/11356/1982). For more information on how the two are related, please consult the entry for GaMS-Instruct-MED 1.0.
dc.language.iso slv
dc.publisher Faculty of Computer and Information Science, University of Ljubljana
dc.publisher VITASIS, d.o.o.
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://www.cjvt.si/povejmo/en/project/
dc.subject specialised corpus
dc.subject medical texts
dc.subject large language models
dc.title Dataset of Slovene medical texts PoVeJMo-VeMo-Med 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Miha Malenšek miha.malensek@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor ARIS (Slovenian Research and Innovation Agency) NOO PoVeJMo research project (Adaptive Natural Language Processing with Large Language Models) nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info 17701 texts
files.count 1
files.size 171707396


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
PoVeJMo-VeMo-Med_1.0.zip
Size
163.75 MB
Format
application/zip
Description
PoVeJMo-VeMo-Med 1.0 (JSON)
MD5
21afb2a3c72b220c325fe9ff388ccb16
 Download file  Preview
 File Preview  
  • PoVeJMo-VeMo-Med_1.0
    • PoVeJMo-VeMo-Med_1.0.json514 MB
    • PoVeJMo-VeMo-Med_1.0-deduplicated.json185 MB
    • 00README.txt2 kB

Show simple item record