dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Fišer, Darja |
dc.date.accessioned | 2017-08-31T07:13:40Z |
dc.date.available | 2017-08-31T07:13:40Z |
dc.date.issued | 2017-08-17 |
dc.identifier.uri | http://hdl.handle.net/11356/1138 |
dc.description | Janes-Blog is an annotated corpus of Slovene blogs from websites rtvslo.si and publishwall.si from the period 2006-10 to 2016-01. The corpus is structured into individual texts containing the post of the blog and comments on the post, together with their metadata. The texts in the corpus are tokenised, sentence segmented, word normalised, morphosyntactically tagged, lemmatised and annotated with named entities. Due to protection of privacy, usernames are not included in the metadata and 'person' as well as 'person derivative' named entities have been removed from the texts. |
dc.language.iso | slv |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://doi.org/10.4312/slo2.0.2016.2.67-99 |
dc.relation.isreferencedby | https://nl.ijs.si/janes/viri/avtomatsko-oznaceni-korpusi/#Janes-Blog |
dc.relation.isreferencedby | https://doi.org/10.1007/s10579-018-9425-z |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://nl.ijs.si/janes/ |
dc.subject | computer-mediated communication |
dc.subject | blogs |
dc.subject | word normalisation |
dc.subject | named entities |
dc.subject | TEI |
dc.title | Blog post and comment corpus Janes-Blog 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
contact.person | Darja Fišer darja.fiser@ff.uni-lj.si Faculty of Arts, University of Ljubljana |
sponsor | ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds |
size.info | 404281 texts |
size.info | 34534431 tokens |
files.count | 2 |
files.size | 431290698 |
featuredService.kontext | Search|https://www.clarin.si/kontext/first_form?corpname=janes_blog |
featuredService.noske | Search|https://www.clarin.si/ske/#dashboard?corpname=janes_blog |
Files in this item
Download all files in item (411.31 MB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- Janes-Blog.TEI.zip
- Size
- 212.66 MB
- Format
- application/zip
- Description
- Corpus in TEI format
- MD5
- 3283531fc545a3c9a33855bf170354e2
- Janes-Blog.TEI
- janes.blog.back.xml465 kB
- schema
- tei_janes_doc.html2 MB
- tei_janes.rng399 kB
- tei_janes_schema.xml2 kB
- tei_janes.zip44 kB
- tei_janes.rnc188 kB
- janes.blog.xml12 kB
- janes.blog.body.xml1 GB
- 00README.txt176 B
- Name
- Janes-Blog.vert.zip
- Size
- 198.65 MB
- Format
- application/zip
- Description
- Derived corpus in vertical format
- MD5
- 1c77a5dc284d4093446dd64c14855cb2
- Janes-Blog.vert
- janes_blog.vert1 GB
- janes_blog.regi5 kB
- 00README.txt176 B