Show simple item record

 
dc.contributor.author Kuzman, Taja
dc.contributor.author Ljubešić, Nikola
dc.date.accessioned 2024-10-07T12:55:35Z
dc.date.available 2024-10-07T12:55:35Z
dc.date.issued 2024-10-07
dc.identifier.uri http://hdl.handle.net/11356/1969
dc.description The genre-enriched MaCoCu-Genre corpus collection comprises web corpora that have been automatically annotated with genre labels. The corpora can be very useful for genre-based creation of subcorpora that can be used for linguistic analyses or various end tasks in the field of natural language processing. The MaCoCu-Genre corpora comprise 67 million texts and 28.5 billion words in 13 European languages: Albanian, Bosnian, Bulgarian, Catalan, Croatian, Greek, Icelandic, Macedonian, Montenegrin, Serbian, Slovenian, Turkish, and Ukrainian (see the README file for sizes of individual corpora). The MaCoCu-Genre corpora are based on the MaCoCu web corpora for Albanian (http://hdl.handle.net/11356/1804), Catalan (http://hdl.handle.net/11356/1837), Greek (http://hdl.handle.net/11356/1839), Icelandic (http://hdl.handle.net/11356/1805), Turkish (http://hdl.handle.net/11356/1802) and Ukrainian (http://hdl.handle.net/11356/1838), and the CLASSLA-web corpora for Bosnian (http://hdl.handle.net/11356/1927), Bulgarian (http://hdl.handle.net/11356/1928), Croatian (http://hdl.handle.net/11356/1929), Macedonian (http://hdl.handle.net/11356/1932), Montenegrin (http://hdl.handle.net/11356/1930), Serbian (http://hdl.handle.net/11356/1931), and Slovenian (http://hdl.handle.net/11356/1882). The CLASSLA-web corpora are a cleaned-up subset of MaCoCu web corpora. During the creation of the MaCoCu-Genre corpora, the CLASSLA-web post-processing has now been applied to the other MaCoCu corpora as well: removal of paragraphs in a non-target language and removal of short texts (less than 75 words). The X-GENRE classifier (http://hdl.handle.net/11356/1961) was used for automatic annotation with genre labels. The model classifies texts to one of 9 genre labels: Information/Explanation, News, Instruction, Opinion/Argumentation, Forum, Prose/Lyrical, Legal, Promotion, and Other. Texts classified with a prediction confidence below 0.8 were assigned the label Mix (refer to the provided README file for the details on the labels). The classifier is based on the multilingual XLM-RoBERTa Transformer-based model (https://huggingface.co/FacebookAI/xlm-roberta-base), and was shown to provide high classification performance when evaluated on 9 languages included in the MaCoCu-Genre corpora (macro-F1 scores between 0.80 and 0.95). High prediction accuracy is also expected for the remaining four languages (Bosnian, Bulgarian, Montenegrin, and Serbian), as they are closely related to Croatian and Macedonian, for which the model has demonstrated strong performance. The MaCoCu-Genre corpora are available in the JSONL format, where each text is accompanied by the following metadata: id (document id from the original web corpus), title, url, domain, tld (top-level domain, e.g., "com"), and genre. Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains.
dc.language.iso slv
dc.language.iso hrv
dc.language.iso mkd
dc.language.iso sqi
dc.language.iso ell
dc.language.iso cat
dc.language.iso isl
dc.language.iso ukr
dc.language.iso tur
dc.language.iso bos
dc.language.iso bul
dc.language.iso cnr
dc.language.iso srp
dc.publisher Jožef Stefan Institute
dc.rights CC0-No Rights Reserved
dc.rights.uri https://creativecommons.org/publicdomain/zero/1.0/
dc.rights.label PUB
dc.source.uri https://macocu.eu/
dc.subject web corpus
dc.subject automatic genre identification
dc.subject genre classification
dc.title Genre-enriched web corpora MaCoCu-Genre
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Taja Kuzman taja.kuzman@ijs.si Jožef Stefan Institute
sponsor Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 67413414 texts
size.info 28481836587 words
files.count 14
files.size 108907951725


 Files in this item

This item is
Publicly Available
and licensed under:
CC0-No Rights Reserved
Icon
Name
README.md
Size
9.01 KB
Format
Unknown
Description
README file for the MaCoCu-Genre corpora
MD5
da7ec8e525f2d18fa6e03fa0949f66ed
 Download file
Icon
Name
MaCoCu-Genre.bg.jsonl.gz
Size
13.51 GB
Format
application/gzip
Description
Bulgarian corpus
MD5
b0dbf58e812fbbeecd93f602cf4adde6
 Download file
Icon
Name
MaCoCu-Genre.bs.jsonl.gz
Size
1.91 GB
Format
application/gzip
Description
Bosnian corpus
MD5
025e2adcb84d2d3853b135c2dc758ead
 Download file
Icon
Name
MaCoCu-Genre.ca.jsonl.gz
Size
3.98 GB
Format
application/gzip
Description
Catalan corpus
MD5
7881287c8abe16879c760598f625ad48
 Download file
Icon
Name
MaCoCu-Genre.cnr.jsonl.gz
Size
431.93 MB
Format
application/gzip
Description
Montenegrin corpus
MD5
1b3219ba3d04d16de59701107fe9dbf3
 Download file
Icon
Name
MaCoCu-Genre.el.jsonl.gz
Size
17.71 GB
Format
application/gzip
Description
Greek corpus
MD5
d9f72dd228930670de623d2476d8fb79
 Download file
Icon
Name
MaCoCu-Genre.hr.jsonl.gz
Size
6.15 GB
Format
application/gzip
Description
Croatian corpus
MD5
ca947f2fb0e85051727f9e91bccfb0a6
 Download file
Icon
Name
MaCoCu-Genre.is.jsonl.gz
Size
2.29 GB
Format
application/gzip
Description
Icelandic corpus
MD5
7cb6ab9e0c7e44c92b84b8e1947f81c4
 Download file
Icon
Name
MaCoCu-Genre.mk.jsonl.gz
Size
1.94 GB
Format
application/gzip
Description
Macedonian corpus
MD5
1405c45d86ce1b44afc00c0cfe04b931
 Download file
Icon
Name
MaCoCu-Genre.sl.jsonl.gz
Size
4.78 GB
Format
application/gzip
Description
Slovenian corpus
MD5
d916830d1f3bbdbe42f0cf041cf8b220
 Download file
Icon
Name
MaCoCu-Genre.sq.jsonl.gz
Size
1.49 GB
Format
application/gzip
Description
Albanian corpus
MD5
47b1f9265d8696475c133a7bf1769f55
 Download file
Icon
Name
MaCoCu-Genre.sr.jsonl.gz
Size
6.48 GB
Format
application/gzip
Description
Serbian corpus
MD5
1244960d52b23e1c741fc76a18f01847
 Download file
Icon
Name
MaCoCu-Genre.tr.jsonl.gz
Size
13.46 GB
Format
application/gzip
Description
Turkish corpus
MD5
abe376c21256798ded30e54770666aa0
 Download file
Icon
Name
MaCoCu-Genre.uk.jsonl.gz
Size
27.31 GB
Format
application/gzip
Description
Ukrainian corpus
MD5
5c7a1eca339b18be270993ec294b0844
 Download file

Show simple item record