Show simple item record

 
dc.contributor.author Wasserscheidt, Philipp
dc.date.accessioned 2023-03-17T12:58:16Z
dc.date.available 2023-03-17T12:58:16Z
dc.date.issued 2023-01-22
dc.identifier.uri http://hdl.handle.net/11356/1752
dc.description PDRS 1.0 is a web corpus based on crawling the .rs domain. Crawling has been done in September and October 2022 with BootCat. As search terms, appr. 2,800 word forms with a frequency between 5,000 and 500,000 in srWaC have been used. The texts are deduplicated, cyrillic texts have been transliterated into the Latin alphabet. The linguistic processing was done with the CLASSLA package (https://github.com/clarinsi/classla) for tokenization, lemmatization and morpho-syntactic tagging (both MULTEXT-East and Universal Dependencies). In addition, some 80% of the URLs are manually tagged for 10 different types of sources ("area"): media (media outlets with several posts daily), inform (topic-centered sites with infrequent posts - maximum 3 per day), company (presentations of companies), state (websites of government bodies on nationa, regional and local level), forum (forum posts), portal (topic-centered portals without daily coverage), science (scientific publications), shop (with descriptions of products), database (knowledge bases, dictionaries, databases and similar) and community (NGOs, fan clubs, associations and other). The corpus is distributed in the CoNLL-U format in batches of appr. 2x50 mio. tokens.
dc.language.iso srp
dc.publisher Institute for Serbian Language SANU
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://javnidiskurs.rs
dc.subject web corpus
dc.subject news discourse
dc.title Serbian Web Corpus PDRS 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Philipp Wasserscheidt philipp.wasserscheidt@hu-berlin.de Humboldt-Universität zu Berlin
sponsor Science Fund of the Republic of Serbia 7750183 Public Discourse in the Republic of Serbia - PDRS nationalFunds
size.info 454187 texts
size.info 31401284 sentences
size.info 715419977 tokens
files.count 7
files.size 6933981229
featuredService.kontext search|https://www.clarin.si/kontext/query?corpname=pdrs10
featuredService.noske search|https://www.clarin.si/ske/#dashboard?corpname=pdrs10&struct_attr_stats=1


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
PDRS10-01.zip
Size
911.74 MB
Format
application/zip
Description
PDRS 1.0 batch 1
MD5
7ac6b556f49d2d9765d1e28b133b5ade
 Download file  Preview
 File Preview  
    • PDRS10_2.conllu3 GB
    • PDRS10_1.conllu3 GB
Icon
Name
PDRS10-02.zip
Size
921.58 MB
Format
application/zip
Description
PDRS 1.0 batch 2
MD5
1fa673065294dd63399e01a304f0b382
 Download file  Preview
 File Preview  
    • PDRS10_4.conllu3 GB
    • PDRS10_3.conllu3 GB
Icon
Name
PDRS10-03.zip
Size
912.11 MB
Format
application/zip
Description
PDRS 1.0 batch 3
MD5
0142446132093d241157b1c6770b78fd
 Download file  Preview
 File Preview  
    • PDRS10_6.conllu3 GB
    • PDRS10_5.conllu3 GB
Icon
Name
PDRS10-04.zip
Size
927.92 MB
Format
application/zip
Description
PDRS 1.0 batch 4
MD5
eccdaf4f178816e9843fc977de856955
 Download file  Preview
 File Preview  
    • PDRS10_8.conllu3 GB
    • PDRS10_7.conllu3 GB
Icon
Name
PDRS10-05.zip
Size
938.35 MB
Format
application/zip
Description
PDRS 1.0 batch 5
MD5
8c3d170354931b48aa3fac194f868a4b
 Download file  Preview
 File Preview  
    • PDRS10_10.conllu3 GB
    • PDRS10_9.conllu3 GB
Icon
Name
PDRS10-06.zip
Size
930.33 MB
Format
application/zip
Description
PDRS 1.0 batch 6
MD5
5f64109f12d329e6015aa5012ce4646f
 Download file  Preview
 File Preview  
    • PDRS10_12.conllu3 GB
    • PDRS10_11.conllu3 GB
Icon
Name
PDRS10-07.zip
Size
1.05 GB
Format
application/zip
Description
PDRS 1.0 batch 7
MD5
fc336177a6d29895514787501a2d650f
 Download file  Preview
 File Preview  
    • PDRS10_15.conllu1006 MB
    • PDRS10_14.conllu3 GB
    • PDRS10_13.conllu3 GB

Show simple item record