Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije
Common Language Resources and Technology Infrastructure, Slovenia

CLARIN workshop
Multilingual corpus annotation tools:
development and integration

Ljubljana, November 10 – 11, 2016

Introduction

Basic annotation of language corpora is a prerequisite for corpus linguistics or any advanced explorations of information content of language. Yet, for many CLARIN languages, online annotation tools are not available. This two-day workshop aimed to close this gap by joining CLARIN members that have locally developed annotation tools or resources in order to integrate them in terms of specifications and offer them as web services in the scope of the WebLicht architecture. The planned multilingual web services to be developed will enhance the utility of workflow construction and execution workflows and feed back into their development and documentation.

The workshop catalogued available tools, resources and encoding standards of the participants and proposed a workplan on how to integrate them with WebLicht, also considering other such environments, such as TextFlows, developed at JSI. The concrete result of the workshop is an implementation plan with its timeline.

Agenda

First Day

Thursday, November 10th, Physics seminar room:

9:00 – 9:30 Introduction T. Erjavec, D. Fišer
9:30 – 10:30 WebLicht M. Hinrichs, W. Qiu
10:30 – 10:45 Coffee break
10:45 – 11:15 TextFlows S. Pollak, M. Martinc, M. Perovšek
11:15 – 12:15 ReLDI data & tools N. Ljubešić
12:15 – 12:45 Estonian data & tools K. Liin
12:45 – 13:45 Lunch
13:45 – 14:15 Latvian data & tools I. Skadiņa, R. Darģis, L. Pretkalniņa
14:15 – 14:45 Discussion all
14:45 – 15:00 Coffee break
15:00 – 16:30 Discussion all
19:00 – Dinner at “Špajza

 Second Day

Friday, November 11th, Biochemistry seminar room:

9:00 – 9:30 Italian data & tools R. Del Gratta
9:30 – 10:00 Czech data & tools P. Stranak
10:00 – 11:00
WebLicht Hackaton
all
11:00 – 11:15 Coffee break
11:15 – 12:45
WebLicht Hackaton +
Drafting the workplan
all
12:45 – 13:45 Lunch
13:45 – 14:45 Drafting the workplan all
14:45 – 15:00 Coffee break
15:00 – 16:30
Workplan discussion
all

Envisaged implementation project

Note that the plan is still under development!

  1. Basic annotation services in WebLicht:
    • Tools for tokenisation,sentence segmentation, morphosyntactic tagging and lemmatisation exposed as Web services and intergrated with WebLicht (internet protocol, TCF I/O)
    • Languages covered: sl, hr, sr, lv, et, cs, it
    • Basic WebLicht documentation and a short video tutorial will be prepared in national languages
  2. Normalisation of words will be added to WebLicht, in the first instance covering sl CMC
  3. Evaluation
    • The functioning of the tools will be tested with Bombard and Awesome profilers
    • A user centred evaluation will be prepared and carried out