dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Vide Ogrin, Petra |
dc.contributor.author | Lenardič, Jakob |
dc.contributor.author | Mlinar Strgar, Mojca |
dc.contributor.author | Frankl, Simona |
dc.date.accessioned | 2022-10-21T08:03:17Z |
dc.date.available | 2022-10-21T08:03:17Z |
dc.date.issued | 2022-06-15 |
dc.identifier.uri | http://hdl.handle.net/11356/1588 |
dc.description | This dataset consists of 51 randomly selected entries from the Slovenian Biographical Lexicon (1925–1991). The text of each entry has been manually tokenised and sentence segmented, marked with named entities and the words lemmatised. It has also been automatically annotated with PoS tags (MULTEXT-East morphosyntactic descriptions) and Universal Dependencies PoS tags, morphological features and dependency parses. Crucially for the envisaged use of the corpus, the abbreviations in the corpus (of which there are 2,041) have been manually expanded so that the expanded abbreviations are also in the correct inflected form, given their context. The corpus is available in the canonical TEI encoding, and derived plain text and CoNLL-U files. The plain-text file has abbreviations and their expansions marked up with [[...]]((...)). There are two CoNLL-U files, one with the text stream with abbreviations, and one with the text stream with expansions. Note that only the one with expansions has syntactic parses. Both CoNLL-U files have the expansions / abbreviations and named entities marked up in IOB format in the last column. |
dc.language.iso | slv |
dc.publisher | Slovenian Academy of Sciences and Arts |
dc.relation | info:eu-repo/grantAgreement/EC/H2020/101004825 |
dc.relation.isreferencedby | https://aclanthology.org/2022.emnlp-main.596/ |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.subject | manual annotation |
dc.subject | biographical lexicon |
dc.subject | abbreviations |
dc.subject | named entities |
dc.subject | tokenisation |
dc.subject | lemmatisation |
dc.title | Annotated sample of the Slovenian Biographical Lexicon SBL-51abbr 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
demo.uri | https://www.slovenska-biografija.si/ |
contact.person | Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
contact.person | Petra Vide Ogrin petra.vide@zrc-sazu.si Slovenian Academy of Sciences and Arts |
sponsor | SAZU - Slovenian Biography ownFunds |
sponsor | European Union EC/H2020/101004825 InTaVia - In/Tangible European Heritage - Visual Analysis, Curation and Communication euFunds info:eu-repo/grantAgreement/EC/H2020/101004825 |
size.info | 51 entries |
size.info | 655 sentences |
size.info | 20932 tokens |
files.count | 3 |
files.size | 1147457 |
Files in this item
Download all files in item (1.09 MB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)



- Name
- sbl-51abbr.xml.zip
- Size
- 531.18 KB
- Format
- application/zip
- Description
- Corpus in TEI format
- MD5
- 7d1a0976cce07f015e53b73e42214b11

- Name
- sbl-51abbr.txt.zip
- Size
- 50.99 KB
- Format
- application/zip
- Description
- Corpus in derived plain-text format
- MD5
- c7e7ec9c79fe9f386d2e5607526e55cd

- Name
- sbl-51abbr.conllu.zip
- Size
- 538.39 KB
- Format
- application/zip
- Description
- Corpus in derived CoNLL-U format
- MD5
- be3de721d21ef398e285b191f2ff5b0b