Prikaži enostavni zapis vnosa

 
dc.contributor.author Jakopin, Primož
dc.date.accessioned 2017-09-25T10:21:07Z
dc.date.available 2017-09-25T10:21:07Z
dc.date.issued 2017
dc.identifier.uri http://hdl.handle.net/11356/1141
dc.description Beseda Corpus Lemmatisation Lexicon for Slovenian language was generated at the Fran Ramovš Institute of Slovenian Language, primarily through inflection of open class words from the Dictionary of Standard Slovenian (Slovar slovenskega knjižnega jezika), augmented by wordforms, their part of speech tags and their lemmas used during the PoS tagging and lemmatization of the Beseda corpus. It was initially (2000) composed of 1 million words from the following texts: Ciril Kosmač Opus - 408,000 words Tomo Križnar: O iskanju ljubezni / On Search for Love or Around the World by Bicycle - 132,000 words George Orwell: 1984 / 1984 - 91,000 words Plato: Država / Republic - 93,000 words Sveto pismo Nove zaveze / The Bible - New Testament - 150,000 words Gustave Flaubert: Bouvard in Pécuchet / Bouvard and Pécuchet - 86,000 words Časopis DELO na internetu (vzorec iz 6.5.1997 - 17.6.1997) / Newspaper DELO on Internet (a sample from 5/6/1997 - 6/17/1997) - 52,000 words After 2000 the following texts were added: Marko Uršič: Štirje časi / Four Seasons - 171,000 words Državni zbor RS 3. sklica - dobesedni zapisi sej: 29. redna seja, zasedanje 01.10.2003 / National Assembly of the Republic of Slovenia - session transcripts: 29th regular session, meeting of 10/1/2003 - 47,000 words Časopis DELO za 3.1.2004 / Newspaper DELO for 1/3/2004 - 75,000 words to round the corpus to 1,300,000 words. Current lexicon was taken from the database of the online "Determination of Lemmas and PoS Tags for a List of Words" service at the Institute, available through the web page: http://bos.zrc-sazu.si/dol_lem1.html Wordform frequencies were compiled from the latest update of the abovementioned corpus (version 138, 1,300,626 words, August 2017) and are therefore approximate. Lexicon is UTF-8 coded, has 3,228,128 lines, each of the following 4 data fields, tab separated: 1. wordform 2. lemma (102,346 different lemmas) 3. PoS tag (explained at http://bos.zrc-sazu.si/bibliografija/o_oznake.html - in Slovenian) 4. approximate corpus frequency; wordform-lemma-PoS entries not in corpus have zero frequency
dc.language.iso slv
dc.publisher ZRC SAZU
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri http://bos.zrc-sazu.si/dol_lem1.html
dc.subject morphology
dc.subject inflection
dc.subject word forms
dc.subject lemmatisation
dc.title Beseda Corpus Lemmatisation Lexicon
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType computationalLexicon
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
demo.uri http://bos.zrc-sazu.si/dol_lem1.html
contact.person Andrej Perdih andrej.perdih@zrc-sazu.si ZRC SAZU
sponsor ARRS P6-0038 The Slovenian Language in Synchronic and Diachronic Development nationalFunds
size.info 3228127 entries
files.count 1
files.size 11076858


 Datoteke v tem vnosu

To je vnos
Publicly Available
z licenco:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Ime
Beseda_Corpus_Lemmatisation_Lexicon.zip
Velikost
10.56 MB
Format
application/zip
Opis
Lexicon in tabular format
MD5
4a3cb9ae3d54c7b3184da91fcac8add1
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • README.TXT2 kB
    • Beseda_Corpus_Lemmatisation_Lexicon.txt92 MB

Prikaži enostavni zapis vnosa