• Repository
  • About
  • Contact
  • CLARIN
  •  Login
  • English Slovenščina
  • CLARIN.SI repository
  • View Item
  •  
  • CLARIN logo
  •   Browse  
    •    All of the Repository  
      •   Issue Date
      •   Authors
      •   Titles
      •   Subjects
      •   Publisher
      •   Language
      •   Type
      •   Rights Label
  •   My Account  
    •    Login
  •   Statistics  
    •    Piwik StatisticsBETA
  •   General Information  
    •    Deposit
    •    Cite
    •    Submission Lifecycle
    •    FAQ
    •    About
    •    Help Desk
 
 

Lemma list of the Beseda Corpus Lemmatisation Lexicon (ELEXIS)

 
CLARIN.SI data & tools
  Authors
Jakopin, Primož
  Item identifier
http://hdl.handle.net/11356/1615
 Project URL
http://bos.zrc-sazu.si/dol_lem1.html
 Demo URL
http://bos.zrc-sazu.si/dol_lem1.html
 Date issued
2020-06-23
 Type
lexicalConceptualResource, text
 Size
134308 entries
 Language(s)
Slovenian
 Description
Lematizacijski slovar (leksikon besednih oblik za Besedo). Beseda Corpus Lemmatisation Lexicon for Slovenian language was generated at the Fran Ramovš Institute of Slovenian Language, primarily through inflection of open class words from the Dictionary of Standard Slovenian (Slovar slovenskega knjižnega jezika), augmented by wordforms, their part of speech tags and their lemmas used during the PoS tagging and lemmatization of the Beseda corpus. It was initially (2000) composed of 1 million words from the following texts: Ciril Kosmač Opus - 408,000 words Tomo Križnar: O iskanju ljubezni / On Search for Love or Around the World by Bicycle - 132,000 words George Orwell: 1984 / 1984 - 91,000 words Plato: Država / Republic - 93,000 words Sveto pismo Nove zaveze / The Bible - New Testament - 150,000 words Gustave Flaubert: Bouvard in Pécuchet / Bouvard and Pécuchet - 86,000 words Časopis DELO na internetu (vzorec iz 6.5.1997 - 17.6.1997) / Newspaper DELO on Internet (a sample from 5/6/1997 - 6/17/1997) - 52,000 words After 2000 the following texts were added: Marko Uršič: Štirje časi / Four Seasons - 171,000 words Državni zbor RS 3. sklica - dobesedni zapisi sej: 29. redna seja, zasedanje 01.10.2003 / National Assembly of the Republic of Slovenia - session transcripts: 29th regular session, meeting of 10/1/2003 - 47,000 words Časopis DELO za 3.1.2004 / Newspaper DELO for 1/3/2004 - 75,000 words to round the corpus to 1,300,000 words. Current lexicon was taken from the database of the online "Determination of Lemmas and PoS Tags for a List of Words" service at the Institute, available through the web page: http://bos.zrc-sazu.si/dol_lem1.html. Wordform frequencies were compiled from the latest update of the abovementioned corpus (version 138, 1,300,626 words, August 2017) and are therefore approximate. See also: http://hdl.handle.net/11356/1141
 Publisher
ZRC SAZU
 Subject(s)
monolingual lemma list pos part of speech
 Collection(s)
CLARIN.SI ELEXIS
Show full item record
 
 

Partners

  • Alpineon, d.o.o.
  • Amebis, d.o.o.
  • Institute of Contemporary History
  • Jožef Stefan Institute
  • National and University Library of Slovenia
  • Slovenian Language Technologies Society

Partners

  • University of Ljubljana
  • University of Maribor
  • University of Nova Gorica
  • University of Primorska
  • ZRC SAZU
  • ZRS Koper

Repository

  • Main page
  • Contact
  • Submission Lifecycle
  • FAQ
  • About and Policies

This platform runs under the software developed for the LINDAT/CLARIAH-CZ repository for linguistics, available on GitHub

CLARIN.SI is supported by the Ministry of Education, Science and Sport of the Republic of Slovenia
under the Programme of "Research Infrastructures".