Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Fišer, Darja
dc.date.accessioned 2017-09-05T14:23:23Z
dc.date.available 2017-09-05T14:23:23Z
dc.date.issued 2017-09-05
dc.identifier.uri http://hdl.handle.net/11356/1142
dc.description Janes-Tweet is an annotated corpus of almost 10 million tweets posted from 2013-06 to 2017-06 by approx. 9,000 users that tweet mostly in Slovene. The corpus is structured into individual tweets, together with their metadata. The tweets in the corpus are tokenised, sentence segmented, word normalised, morphosyntactically tagged, lemmatised and annotated with named entities. Due to Twitter terms-of-service, the corpus is distributed in an encoded version. The included tweetpub program (also available and documented on https://github.com/clarinsi/tweetpub) should be used to decode it, which it does by fetching the original tweets and applying a diff operation on the distributed corpus. Note that the retrieved corpus can have fewer tweets than the distributed version if some have been removed from Twitter by their authors in the meantime.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://revije.ff.uni-lj.si/slovenscina2/article/view/7003
dc.relation.isreferencedby https://nl.ijs.si/janes/viri/avtomatsko-oznaceni-korpusi/#Janes-Tweet
dc.relation.isreferencedby https://doi.org/10.1007/s10579-018-9425-z
dc.rights Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc/4.0/
dc.rights.label PUB
dc.source.uri https://nl.ijs.si/janes/
dc.subject computer-mediated communication
dc.subject Twitter
dc.subject word normalisation
dc.subject named entities
dc.title Twitter corpus Janes-Tweet 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
contact.person Darja Fišer darja.fiser@ff.uni-lj.si Faculty of Arts, University of Ljubljana
sponsor ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
size.info 9877648 texts
size.info 139299220 tokens
files.count 2
files.size 1252513859
featuredService.kontext Search|https://www.clarin.si/kontext/first_form?corpname=janes_tweet
featuredService.noske Search|https://www.clarin.si/ske/#dashboard?corpname=janes_tweet


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
Distributed under Creative Commons Attribution Required Noncommercial
Icon
Name
Janes-Tweet.vert.zip
Size
1.17 GB
Format
application/zip
Description
Encoded corpus in vertical format
MD5
e02cd3ae95f1efd4325c587ed3ba2a46
 Download file  Preview
 File Preview  
Icon
Name
tweetpub.zip
Size
89.55 KB
Format
application/zip
Description
Pogram for fetching the tweets and decoding the corpus
MD5
fa41f1932ac568f47ad5e9cbd9587cdf
 Download file  Preview
 File Preview  
  • tweetpub
    • LICENSE11 kB
    • janes.tweet.vert.toy30 kB
    • README.md6 kB
    • janes.tweet.vert.toy.enc.dec30 kB
    • .git
      • logs
      • info
        • exclude240 B
      • config261 B
      • ORIG_HEAD41 B
      • FETCH_HEAD98 B
      • index624 B
      • packed-refs107 B
      • HEAD23 B
      • refs
      • description73 B
      • hooks
        • applypatch-msg.sample478 B
        • pre-push.sample1 kB
        • commit-msg.sample896 B
        • pre-rebase.sample4 kB
        • post-update.sample189 B
        • update.sample3 kB
        • pre-applypatch.sample424 B
        • pre-commit.sample1 kB
        • prepare-commit-msg.sample1 kB
      • objects
        • 39
          • 127855c149215cd54a458718e9f7e07572bbbc2 kB
          • e896e8f631e53b44ce5f4d63660158a4dbd2f1187 B
        • 0d
          • 808335aaa4c08c87e9d2d686749ea72d65000d2 kB
        • b6
          • af56db1f9d776d389c79a122706bbb10b72424231 B
          • 5873ffa1fc67cc4325a6aab2f486bf8a8edc82142 B
        • 97
          • 52e10d60ac6192b66218e8abf83b4d16a828385 kB
          • b590d498b92a42c855b56c214fbce7622eff13207 B
        • 96
          • 12ede23ba37564366675af36c4e24e4547c853181 B
        • 33
          • 2c9c3f603ac60306418ea6a2523c6975583747205 B
        • 63
          • 32e9917d48aa7946113f28075bf7b6d21e684185 B
        • eb
          • 84d9bc2e03be665e4e47eab103dc80828f6ae57 kB
        • e2
          • b2365cdb702ef890f43c19b8add74c4305d0761 kB
        • 90
          • 24811d36fa2d00c76aae061236e11326595c45801 B
        • 49
          • 71ea2526f293cb7f52157e7f3f08b7db94923b6 kB
        • cf
          • 72939ed3ae38b73ee52c118ee44f55f2e67a02196 B
        • 14
          • ecfbd6257047a7185cbf2ee500dfab7fa0504e74 B
        • 76
          • 1145c596e820074bcaf9cd51bf1a91f3fe40fc207 B
        • 71
          • c60b4262a5110bccb5199df2c992b43fbfc32d177 B
        • pack
          • a7
            • b661fab211e7899dc5f5799e730b3da995c928180 B
          • info
            • ae
              • 63fc7f71ec5c46a16ead01b723878882ecd8982 kB
            • d6
              • ed8125ccecab3f7464e11cb881c90fc683dc02196 B
            • d5
              • fda9a7fef04f6e99ca99cf1768094134a176e02 kB
            • 8d
              • ada3edaf50dbc082c9a125058f25def75e625a4 kB
          • branches
          • encode_tweetpub.py1 kB
          • janes.tweet.vert.toy.enc41 kB
          • decode_tweetpub.py2 kB

        Show simple item record