dc.description |
The genre-enriched MaCoCu-Genre corpus collection comprises web corpora that have been automatically annotated with genre labels. The corpora can be very useful for genre-based creation of subcorpora that can be used for linguistic analyses or various end tasks in the field of natural language processing. The MaCoCu-Genre corpora comprise 67 million texts and 28.5 billion words in 13 European languages: Albanian, Bosnian, Bulgarian, Catalan, Croatian, Greek, Icelandic, Macedonian, Montenegrin, Serbian, Slovenian, Turkish, and Ukrainian (see the README file for sizes of individual corpora).
The MaCoCu-Genre corpora are based on the MaCoCu web corpora for Albanian (http://hdl.handle.net/11356/1804), Catalan (http://hdl.handle.net/11356/1837), Greek (http://hdl.handle.net/11356/1839), Icelandic (http://hdl.handle.net/11356/1805), Turkish (http://hdl.handle.net/11356/1802) and Ukrainian (http://hdl.handle.net/11356/1838), and the CLASSLA-web corpora for Bosnian (http://hdl.handle.net/11356/1927), Bulgarian (http://hdl.handle.net/11356/1928), Croatian (http://hdl.handle.net/11356/1929), Macedonian (http://hdl.handle.net/11356/1932), Montenegrin (http://hdl.handle.net/11356/1930), Serbian (http://hdl.handle.net/11356/1931), and Slovenian (http://hdl.handle.net/11356/1882). The CLASSLA-web corpora are a cleaned-up subset of MaCoCu web corpora. During the creation of the MaCoCu-Genre corpora, the CLASSLA-web post-processing has now been applied to the other MaCoCu corpora as well: removal of paragraphs in a non-target language and removal of short texts (less than 75 words).
The X-GENRE classifier (http://hdl.handle.net/11356/1961) was used for automatic annotation with genre labels. The model classifies texts to one of 9 genre labels: Information/Explanation, News, Instruction, Opinion/Argumentation, Forum, Prose/Lyrical, Legal, Promotion, and Other. Texts classified with a prediction confidence below 0.8 were assigned the label Mix (refer to the provided README file for the details on the labels). The classifier is based on the multilingual XLM-RoBERTa Transformer-based model (https://huggingface.co/FacebookAI/xlm-roberta-base), and was shown to provide high classification performance when evaluated on 9 languages included in the MaCoCu-Genre corpora (macro-F1 scores between 0.80 and 0.95). High prediction accuracy is also expected for the remaining four languages (Bosnian, Bulgarian, Montenegrin, and Serbian), as they are closely related to Croatian and Macedonian, for which the model has demonstrated strong performance.
The MaCoCu-Genre corpora are available in the JSONL format, where each text is accompanied by the following metadata: id (document id from the original web corpus), title, url, domain, tld (top-level domain, e.g., "com"), and genre.
Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus.
This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains. |