MULTEXT-East Lexica Version 4 http://nl.ijs.si/ME/V4/ This directory contains the following files: 00README.txt This file Word-form lexica in MULTEXT format, with conditions on availability: wfl-bg.txt Bulgarian free wfl-cs.txt Czech free wfl-en.txt English free wfl-et.txt Estonian free wfl-fr.txt French free wfl-hu.txt Hungarian free wfl-ro.txt Romanian free wfl-sk.txt Slovak free wfl-sl-rozaj.txt Resian (sl dialect) free wfl-sl.txt Slovene free wfl-uk.txt Ukrainian free Separate submission: wfl-fa.txt Farsi/Persian license for research use only wfl-mk.txt Macedonian license for research use only wfl-pl.txt Polish license for research use only wfl-ru.txt Russian license for research use only wfl-sr.txt Serbian license for research use only The word-form lexica are in MULTEXT format, where each entry is in a separate line and contains (at least) three fields; the first filed of the entry is the word-form, the second the lemma, and the third the morphosyntactic description, MSD. Some lexica also make use of further columns, e.g. Persian gives the transliteration to ASCII of the word-form and lemma. The files are encoded in UTF-8 with TAB (^I) as record separator and Unix-type end-of-lines (^J). Sort order is UTF-8. When the word-form or lemma contains spaces, these are substituted by underscore. The MSD are defined in the MULTEXT-East morphosyntactic specifications, http://nl.ijs.si/ME/V4/msd/ Note that the lexica use the definitions for the particular language, not the common ones. Responsibility: Bulgarian: L. Dimitrova, L. Sinapova, K. Simov, D. Popov, Sv. Manova-Vidinska Department of Mathematical Linguistics Institute of Mathematics and Informatics Bulgarian Academy of Sciences Czech: V.Petkevic, J.Klimova and V.Schmiedtova Institute of Theoretical and Computational Linguistics Faculty of Philosophy Charles University Estonian: H.J.Kaalep, E.Toomsalu Department of General Linguistics Tartu University English: N. Ide, G. Priest-Dorman Dept. of Computer Science Vassar College Farsi: B. QasemiZadeh and S. Rahimi Digital Enterprise Research Institute Galway, Ireland Hungarian: C.Oravecz and L.Tihanyi Research Institute for Linguistics Hungarian Academy of Sciences Macedonian: Aleksandar Petrovski Polish: N. Kotsyba(1), I. Derzhanski(2), and A. Radziszewski(3) (1) Institute of Interdisciplinary Studies, Warsaw University (2) Institute of Mathematics and Informatics, Bulgarian Academy of Sciences (3) Institute of Informatics, Wroclaw University of Technology Resian: Han Steenwijk Dipartimento di Lingue e Letterature Anglo-Germaniche e Slave Padova University Romanian: S.Bruda, C.Diaconu, L.Diaconu, and D.Tufis Center for Research in Machine Learning, Natural Language Processing and Conceptual Modelling Romanian Academy of Sciences Serbian: C. Krstev Computer Science Departement Faculty of Mathematics University of Belgrade Slovak: R. Garabik L. Stur Institute of Linguistics Slovak Academy of Sciences Slovene: T. Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute Ukrainian: N. Kotsyba, I. Shevchenko(2), I. Derzhanski(3), and A. Mykulyak(1) (1) Institute of Interdisciplinary Studies, Warsaw University (2) Ukrainian Lingua-Information Fund, National Academy of Sciences of Ukraine (3) Institute of Mathematics and Informatics, Bulgarian Academy of Sciences The MULTEXT-East partners would like to acknowledge the contributors of the following lexica which served as the basis of the MULTEXT-East ones: Czech lexicon: dr. Jan Hajic and BYLL Software Hungarian lexicon: MorphoLogic Slovene lexicon: Amebis d.o.o. Polish lexicon: Marcin Woliski: Morfeusz morphological analyzer (http://nlp.ipipan.waw.pl/~wolinski/morfeusz/), c.f. Marcin Woliski. Morfeusz, a Practical Tool for the Morphological Analysis of Polish. In: Intelligent Information Processing and Web Mining, IIS:IIPWM'06 Proceedings, pp. 503-512, Springer, 2006. ================================================================================ Tomaz Erjavec, JSI 2010-05-09