Show simple item record

 
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Vide Ogrin, Petra
dc.contributor.author Lenardič, Jakob
dc.contributor.author Mlinar Strgar, Mojca
dc.contributor.author Frankl, Simona
dc.date.accessioned 2022-10-21T08:03:17Z
dc.date.available 2022-10-21T08:03:17Z
dc.date.issued 2022-06-15
dc.identifier.uri http://hdl.handle.net/11356/1588
dc.description This dataset consists of 51 randomly selected entries from the Slovenian Biographical Lexicon (1925–1991). The text of each entry has been manually tokenised and sentence segmented, marked with named entities and the words lemmatised. It has also been automatically annotated with PoS tags (MULTEXT-East morphosyntactic descriptions) and Universal Dependencies PoS tags, morphological features and dependency parses. Crucially for the envisaged use of the corpus, the abbreviations in the corpus (of which there are 2,041) have been manually expanded so that the expanded abbreviations are also in the correct inflected form, given their context. The corpus is available in the canonical TEI encoding, and derived plain text and CoNLL-U files. The plain-text file has abbreviations and their expansions marked up with [[...]]((...)). There are two CoNLL-U files, one with the text stream with abbreviations, and one with the text stream with expansions. Note that only the one with expansions has syntactic parses. Both CoNLL-U files have the expansions / abbreviations and named entities marked up in IOB format in the last column.
dc.language.iso slv
dc.publisher Slovenian Academy of Sciences and Arts
dc.relation info:eu-repo/grantAgreement/EC/H2020/101004825
dc.relation.isreferencedby https://aclanthology.org/2022.emnlp-main.596/
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.subject manual annotation
dc.subject biographical lexicon
dc.subject abbreviations
dc.subject named entities
dc.subject tokenisation
dc.subject lemmatisation
dc.title Annotated sample of the Slovenian Biographical Lexicon SBL-51abbr 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
demo.uri https://www.slovenska-biografija.si/
contact.person Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
contact.person Petra Vide Ogrin petra.vide@zrc-sazu.si Slovenian Academy of Sciences and Arts
sponsor SAZU - Slovenian Biography ownFunds
sponsor European Union EC/H2020/101004825 InTaVia - In/Tangible European Heritage - Visual Analysis, Curation and Communication euFunds info:eu-repo/grantAgreement/EC/H2020/101004825
size.info 51 entries
size.info 655 sentences
size.info 20932 tokens
files.count 3
files.size 1147457


 Files in this item

 Download all files in item (1.09 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
sbl-51abbr.xml.zip
Size
531.18 KB
Format
application/zip
Description
Corpus in TEI format
MD5
7d1a0976cce07f015e53b73e42214b11
 Download file  Preview
 File Preview  
    • sbl-51abbr.xml6 MB
Icon
Name
sbl-51abbr.txt.zip
Size
50.99 KB
Format
application/zip
Description
Corpus in derived plain-text format
MD5
c7e7ec9c79fe9f386d2e5607526e55cd
 Download file  Preview
 File Preview  
    • sbl-51abbr.txt132 kB
Icon
Name
sbl-51abbr.conllu.zip
Size
538.39 KB
Format
application/zip
Description
Corpus in derived CoNLL-U format
MD5
be3de721d21ef398e285b191f2ff5b0b
 Download file  Preview
 File Preview  
    • sbl-51abbr-abbr.conll1 MB
    • sbl-51abbr-expan.conll1 MB

Show simple item record