dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Fišer, Darja |
dc.date.accessioned | 2017-09-05T14:23:23Z |
dc.date.available | 2017-09-05T14:23:23Z |
dc.date.issued | 2017-09-05 |
dc.identifier.uri | http://hdl.handle.net/11356/1142 |
dc.description | Janes-Tweet is an annotated corpus of almost 10 million tweets posted from 2013-06 to 2017-06 by approx. 9,000 users that tweet mostly in Slovene. The corpus is structured into individual tweets, together with their metadata. The tweets in the corpus are tokenised, sentence segmented, word normalised, morphosyntactically tagged, lemmatised and annotated with named entities. Due to Twitter terms-of-service, the corpus is distributed in an encoded version. The included tweetpub program (also available and documented on https://github.com/clarinsi/tweetpub) should be used to decode it, which it does by fetching the original tweets and applying a diff operation on the distributed corpus. Note that the retrieved corpus can have fewer tweets than the distributed version if some have been removed from Twitter by their authors in the meantime. |
dc.language.iso | slv |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://revije.ff.uni-lj.si/slovenscina2/article/view/7003 |
dc.relation.isreferencedby | https://nl.ijs.si/janes/viri/avtomatsko-oznaceni-korpusi/#Janes-Tweet |
dc.relation.isreferencedby | https://doi.org/10.1007/s10579-018-9425-z |
dc.rights | Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-nc/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://nl.ijs.si/janes/ |
dc.subject | computer-mediated communication |
dc.subject | |
dc.subject | word normalisation |
dc.subject | named entities |
dc.title | Twitter corpus Janes-Tweet 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
contact.person | Darja Fišer darja.fiser@ff.uni-lj.si Faculty of Arts, University of Ljubljana |
sponsor | ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds |
size.info | 9877648 texts |
size.info | 139299220 tokens |
files.count | 2 |
files.size | 1252513859 |
featuredService.kontext | Search|https://www.clarin.si/kontext/first_form?corpname=janes_tweet |
featuredService.noske | Search|https://www.clarin.si/ske/#dashboard?corpname=janes_tweet |
Files in this item
This item is
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
- Name
- Janes-Tweet.vert.zip
- Size
- 1.17 GB
- Format
- application/zip
- Description
- Encoded corpus in vertical format
- MD5
- e02cd3ae95f1efd4325c587ed3ba2a46
- Janes-Tweet.vert
- janes_tweet.vert.enc10 GB
- janes_tweet.regi4 kB
- 00README.txt161 B
- Name
- tweetpub.zip
- Size
- 89.55 KB
- Format
- application/zip
- Description
- Pogram for fetching the tweets and decoding the corpus
- MD5
- fa41f1932ac568f47ad5e9cbd9587cdf
- tweetpub
- LICENSE11 kB
- janes.tweet.vert.toy30 kB
- README.md6 kB
- janes.tweet.vert.toy.enc.dec30 kB
- .git
- logs
- info
- exclude240 B
- config261 B
- ORIG_HEAD41 B
- FETCH_HEAD98 B
- index624 B
- packed-refs107 B
- HEAD23 B
- refs
- description73 B
- hooks
- applypatch-msg.sample478 B
- pre-push.sample1 kB
- commit-msg.sample896 B
- pre-rebase.sample4 kB
- post-update.sample189 B
- update.sample3 kB
- pre-applypatch.sample424 B
- pre-commit.sample1 kB
- prepare-commit-msg.sample1 kB
- objects
- 39
- 127855c149215cd54a458718e9f7e07572bbbc2 kB
- e896e8f631e53b44ce5f4d63660158a4dbd2f1187 B
- 0d
- 808335aaa4c08c87e9d2d686749ea72d65000d2 kB
- b6
- af56db1f9d776d389c79a122706bbb10b72424231 B
- 5873ffa1fc67cc4325a6aab2f486bf8a8edc82142 B
- 97
- 52e10d60ac6192b66218e8abf83b4d16a828385 kB
- b590d498b92a42c855b56c214fbce7622eff13207 B
- 96
- 12ede23ba37564366675af36c4e24e4547c853181 B
- 33
- 2c9c3f603ac60306418ea6a2523c6975583747205 B
- 63
- 32e9917d48aa7946113f28075bf7b6d21e684185 B
- eb
- 84d9bc2e03be665e4e47eab103dc80828f6ae57 kB
- e2
- b2365cdb702ef890f43c19b8add74c4305d0761 kB
- 90
- 24811d36fa2d00c76aae061236e11326595c45801 B
- 49
- 71ea2526f293cb7f52157e7f3f08b7db94923b6 kB
- cf
- 72939ed3ae38b73ee52c118ee44f55f2e67a02196 B
- 14
- ecfbd6257047a7185cbf2ee500dfab7fa0504e74 B
- 76
- 1145c596e820074bcaf9cd51bf1a91f3fe40fc207 B
- 71
- c60b4262a5110bccb5199df2c992b43fbfc32d177 B
- pack
- a7
- b661fab211e7899dc5f5799e730b3da995c928180 B
- info
- ae
- 63fc7f71ec5c46a16ead01b723878882ecd8982 kB
- d6
- ed8125ccecab3f7464e11cb881c90fc683dc02196 B
- d5
- fda9a7fef04f6e99ca99cf1768094134a176e02 kB
- 8d
- ada3edaf50dbc082c9a125058f25def75e625a4 kB
- 39
- branches
- encode_tweetpub.py1 kB
- janes.tweet.vert.toy.enc41 kB
- decode_tweetpub.py2 kB