Show simple item record

 
dc.contributor.author Jakopin, Primož
dc.date.accessioned 2017-09-25T08:53:32Z
dc.date.available 2017-09-25T08:53:32Z
dc.date.issued 2017
dc.identifier.uri http://hdl.handle.net/11356/1155
dc.description Nova beseda Frequency Lexicon was compiled from the Nova beseda text corpus at the Fran Ramovš Institute of Slovenian Language with hyphen characters unified and with leading and trailing non-breaking spaces deleted. Unlike most other Slovenian corpora Nova beseda texts were pre-processed before inclusion. Typos and words with supefluous hyphens, originating from false line joinings were corrected and parts of texts in foreign, non-Slovenian language were marked-up and excluded from the lexicon. The corpus contains 318 million tokens, mostly wordforms. It is available for search through the web page http://bos.zrc-sazu.si/a_beseda.html, where wordform search is reached by selecting "word seach" in the right hand side "What to do?" column. On the mentioned web page the corpus structure is also explained. The lexicon is UTF-8 coded, has 2,251,151 lines, each containing the following 2 data fields, tab separated: 1. token, Slovenian: pojavnica. The vast majority of tokens are wordforms, also included are numbers and selected multiword units such as URLs, e-mail addresses, place names like New York, car plates, ID numbers. 2. frequency, Slovenian: pogostnost. The sum of all frequencies is 318,170,212.
dc.language.iso slv
dc.publisher ZRC SAZU
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri http://bos.zrc-sazu.si/a_beseda.html
dc.subject word forms
dc.subject lexicon
dc.title Nova Beseda Frequency Lexicon
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType computationalLexicon
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
demo.uri http://bos.zrc-sazu.si/a_beseda.html
contact.person Andrej Perdih andrej.perdih@zrc-sazu.si ZRC SAZU
sponsor ARRS P6-0038 The Slovenian Language in Synchronic and Diachronic Development nationalFunds
size.info 2251151 entries
files.count 1
files.size 8789043


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
Nova_beseda_Frequency_Lexicon.zip
Size
8.38 MB
Format
application/zip
Description
Frequency lexicon in .txt format
MD5
4430f39db32be86182553145d0e08a16
 Download file  Preview
 File Preview  
    • README.TXT1 kB
    • Nova_beseda_Frequency_Lexicon.txt29 MB

Show simple item record