Prikaži enostavni zapis vnosa

 
dc.contributor.author Wasserscheidt, Philipp
dc.date.accessioned 2023-03-17T12:58:16Z
dc.date.available 2023-03-17T12:58:16Z
dc.date.issued 2023-01-22
dc.identifier.uri http://hdl.handle.net/11356/1752
dc.description PDRS 1.0 is a web corpus based on crawling the .rs domain. Crawling has been done in September and October 2022 with BootCat. As search terms, appr. 2,800 word forms with a frequency between 5,000 and 500,000 in srWaC have been used. The texts are deduplicated, cyrillic texts have been transliterated into the Latin alphabet. The linguistic processing was done with the CLASSLA package (https://github.com/clarinsi/classla) for tokenization, lemmatization and morpho-syntactic tagging (both MULTEXT-East and Universal Dependencies). In addition, some 80% of the URLs are manually tagged for 10 different types of sources ("area"): media (media outlets with several posts daily), inform (topic-centered sites with infrequent posts - maximum 3 per day), company (presentations of companies), state (websites of government bodies on nationa, regional and local level), forum (forum posts), portal (topic-centered portals without daily coverage), science (scientific publications), shop (with descriptions of products), database (knowledge bases, dictionaries, databases and similar) and community (NGOs, fan clubs, associations and other). The corpus is distributed in the CoNLL-U format in batches of appr. 2x50 mio. tokens.
dc.language.iso srp
dc.publisher Institute for Serbian Language SANU
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://javnidiskurs.rs
dc.subject web corpus
dc.subject news discourse
dc.title Serbian Web Corpus PDRS 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Philipp Wasserscheidt philipp.wasserscheidt@hu-berlin.de Humboldt-Universität zu Berlin
sponsor Science Fund of the Republic of Serbia 7750183 Public Discourse in the Republic of Serbia - PDRS nationalFunds
size.info 454187 texts
size.info 31401284 sentences
size.info 715419977 tokens
files.count 7
files.size 6933981229
featuredService.kontext search|https://www.clarin.si/kontext/query?corpname=pdrs10
featuredService.noske search|https://www.clarin.si/ske/#dashboard?corpname=pdrs10


 Datoteke v tem vnosu

To je vnos
Publicly Available
z licenco:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Ime
PDRS10-01.zip
Velikost
911.74 MB
Format
application/zip
Opis
PDRS 1.0 batch 1
MD5
7ac6b556f49d2d9765d1e28b133b5ade
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • PDRS10_2.conllu3 GB
    • PDRS10_1.conllu3 GB
Icon
Ime
PDRS10-02.zip
Velikost
921.58 MB
Format
application/zip
Opis
PDRS 1.0 batch 2
MD5
1fa673065294dd63399e01a304f0b382
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • PDRS10_4.conllu3 GB
    • PDRS10_3.conllu3 GB
Icon
Ime
PDRS10-03.zip
Velikost
912.11 MB
Format
application/zip
Opis
PDRS 1.0 batch 3
MD5
0142446132093d241157b1c6770b78fd
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • PDRS10_6.conllu3 GB
    • PDRS10_5.conllu3 GB
Icon
Ime
PDRS10-04.zip
Velikost
927.92 MB
Format
application/zip
Opis
PDRS 1.0 batch 4
MD5
eccdaf4f178816e9843fc977de856955
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • PDRS10_8.conllu3 GB
    • PDRS10_7.conllu3 GB
Icon
Ime
PDRS10-05.zip
Velikost
938.35 MB
Format
application/zip
Opis
PDRS 1.0 batch 5
MD5
8c3d170354931b48aa3fac194f868a4b
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • PDRS10_10.conllu3 GB
    • PDRS10_9.conllu3 GB
Icon
Ime
PDRS10-06.zip
Velikost
930.33 MB
Format
application/zip
Opis
PDRS 1.0 batch 6
MD5
5f64109f12d329e6015aa5012ce4646f
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • PDRS10_12.conllu3 GB
    • PDRS10_11.conllu3 GB
Icon
Ime
PDRS10-07.zip
Velikost
1.05 GB
Format
application/zip
Opis
PDRS 1.0 batch 7
MD5
fc336177a6d29895514787501a2d650f
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • PDRS10_15.conllu1006 MB
    • PDRS10_14.conllu3 GB
    • PDRS10_13.conllu3 GB

Prikaži enostavni zapis vnosa