CLARIN.SI Guidelines for data submission

CLARIN.SI guidelines for data submission

The CLARIN.SI repository does not, as a rule, accept entries without data (i.e. without the bitstreams attached to the entry) and this document gives guidelines on the structure of the deposited language resources, which formats are accepted by the CLARIN.SI repository, and what standards should be used as the annotation formats in the textual language resource files. Please note that the complete list of accepted data formats for CLARIN.SI is given in the CLARIN Standards Information System entry for CLARIN.SI.

Contents

Basic aspects

Names of files and directories
File extensions
File compression
Entries with more than one bitstream

Controlled values
- Language codes
- Dates and times
Accepted binary formats
Encoding of textual files
References

Basic aspects

Names of files and directories

Filenames, as well as directory (folder) names, should contain only ASCII letters, digits, the hyphen ("-") and period (".") characters. They should not contain spaces, underscores, brackets, quotes, dollars, slashes, colons, or other punctuation characters (except hyphen and period), nor accented letters or other non-ASCII characters. Note that it is allowed to use a mixture of upper and lower-case letters. Examples of good filenames are "semcro.v1.zip", "ParlaMint-HR-S08.xml", "SuperGLUE-statistics.tsv".

File extensions

Standard or commonly recognised file extensions should be used, such as ".txt", ".xml", ".jpg". If the authors would like to indicate that the file is in a standard encoding, or that an archive file contains files of a certain type, then double extensions can be used, e.g. "semcro.tei.xml" or "semcro.TEI.zip".

In the rest of this document, the preferred extensions are given next to the file types.

File compression

Language resources, deposited in the repository, are often large and/or consist of a number of files. In such cases, the resource should be compressed and submitted as a single file. A complete directory should be compressed, and the name of the compressed file should be the same as the directory it unpacks in. For example, the file "semcro.zip" should unpack into the directory "semcro/" which then contains the files and possibly subdirectories. It is recommended that the directory also contains a README text file, which gives the title of the resource and its handle. This prevents Creative Commons resources from losing their origin. It is also recommended that hidden files (such as those created by Macs) are not included in the compressed file.

CLARIN.SI prefers ZIP (.zip) files, but accepts also other container compression formats, such as compressed TAR (.tgz) or, for single files, GNU ZIP (.gz).

Entries with more than one bitstream

The data that is part of a submission, can contain, in addition to the source data, also the data in derived formats, accompanying documentation, or the data split into several pieces. This last case makes sense for large submissions, as an individual file should not exceed (approximately) 5 GB.

In such cases the data can be submitted as several (possibly compressed) files. Follow some existing examples from the CLARIN.SI repository with different reasons for having several bitstreams:

Controlled values

All dates and times that appear in a machine-processable context should follow ISO 8601, i.e. “2020-12-28” for a date, “23:21:21” for a time, and “2020-12-28T23:21:12” for a combination of the two. If the time zone is important, then this can be specified via the suffix “Z” which refers to the Coordinated Universal Time (mostly the same as Greenwich mean time) and the offset to UTC, e.g. 23:21:12Z+01:00 for Slovenia and most other EU countries.

Language codes

When the data (or filename) needs to refer to a certain language, language codes should be used, rather than names of languages. When they exists, the two-letter ISO 639-1 language codes should be used, while for languages that do not have a two-letter code, the ISO 639-2 three letter so called “T-code” should be used, or equivalently, a three letter code from ISO 639-3, which also contains even more language codes than does ISO 639-2. All three standards concentrate on modern standard languages, although ISO 639-2 and ISO 639-3 also cover some historical variants of a few languages.

If the language resource is of a dialect, or of a historical variant not covered by the ISO standards, then BCP 47 (Tags for Identifying Languages) should be used. In short, this Best Current Practice states that, if possible, the appropriate code from the IANA Language Subtag Registry should be used. In case the registry does not contain the required language variant, then the rules for constructing an unregistered code are to start with the code for the language (e.g. “sl” for Slovenian), add the string “-x-” (for “unregistered”) and choose an arbitrary ASCII suffix (e.g. “prekmurje” for the Slovenian Prekmurje dialect), giving “sl-x-prekmurje”.

You can read more about how to choose the correct language (variety) tag in the W3C document “Choosing a Language Tag”.

Dates and times

Accepted binary formats

CLARIN.SI accepts data in standard and/or well-known open formats. For the full listing of accepted formats see the CLARIN ERIC Standards Information System entry for CLARIN.SI.

As most of the submissions involve marked-up or otherwise structured (mostly) textual data, and the rules here are somewhat more complicated, this is covered in the next section. Here we give the other (i.e. binary) formats accepted by CLARIN.SI:

Compression and packaging: GNU ZIP (.gz), ZIP (.zip), TAR (.tar), compressed TAR files (.tgz).
Document files: for typeset document files we accept PDF (.pdf), but only for accompanying documentation (published papers or reports on the dataset, annotation guidelines etc.), and not for the primary dataset itself – an exception are facsimiles, which can be, in addition to image formats, submitted also as PDFs.
Language models: if annotation or other open source language analysis tools produce or use only binary language models, these are accepted into the repository, although we prefer text versions of the models.
Audio files: Wave (.wav), FLAC (.flac), AIFF (.aiff), MPEG 4 audio (.m4a), MP3 (.mp3), RAW (.raw). Note that we prefer audio files with lossless or no compression.
Image files: TIFF (.tiff), GIF (.gif), JPEG (.jpg), PNG (.png), SVG (.svg). Note that we prefer images with lossless or no compression.
Video files: MPEG video (.mpg, .mpeg, .mp4), AVI (.avi), Motion JPEG 2000 (.m2j).

CLARIN.SI, as a rule, does not accept word processor or spreadsheet formats, such as Microsoft Word or Excel files nor binary (compiled) programs. We allow such data in special cases, which should be discussed in advance via the repository Help Desk. If Word or Excel files are then submitted, they should always be in their XML encoding (i.e. with .docx and .xlsx extensions); equivalent OpenOffice XML-based ODF formats are, of course, also ok.

Encoding of textual files

As most of the repository submissions involve files, which are essentially text files (including numeric data, source program files, XML files, etc.), we here explain how such files should be encoded in more detail.

Character encoding

CLARIN.SI accepts only Unicode files. We do not accept files with 8-bit encodings, such as ISO 8859 or Windows code pages. The Unicode files should be encoded in UTF-8, with exceptions being text files in non-Latin based scripts, such as Japanese, which can use UTF-16.

Programs

We accept source code, i.e. programs in any of the better known programming languages, such as Python (.py), Perl (.pl), R (.r), C (.c), XSLT (.xsl) etc. including data packaged for such programs.

Standard text-based formats

We also accept text-based formats which are supported by various standards bodies so that documentation is openly available on the Web. Prominent examples are JSON (.json) and RDF/Turtle (.ttl) files, or XML, which is further discussed below.

Plain text files

For unstructured text we accept plain text files (.txt). Trivial formatting, such as the fact that a line break indicates a new paragraph or that text in square brackets indicates a transcriber comment can also be included, as long as the conventions used are explained in a README file.

Tabular data

For spreadsheet or database-like data , we accept commonly used formats such as tab (.txt/.tsv/.tab) and comma (.txt/.csv) separated values. The tabular files should contain a header row and the data should be accompanied with a README file, explaining the meaning of the columns.

Annotated corpora can be submitted in the CoNLL-U format (.conll or .conllu) used by the Universal Dependencies project. We also accept the so-called vertical files (.vert/.vrt), which are a mixture of tabular data and XML-like tags, and are used by CQP-based concordancers, i.e. the Corpus Query Workbench, (no)Sketch Engine and KonText. If vertical files are deposited, they should also be accompanied by their registry file. Many corpora available in the CLARIN.SI repository are, along with their TEI encoding, available also as vertical files, and are also linked to the CLARIN.SI infrastructure (no)Sketch Engine andKonText concordancers.

HTML documents

We do not accept HTML (.html/.htm) documents as primary data, however, they can be used for documenting the entry, e.g. containing the explanation of the structure of the data or its linguistic annotation. Such HTML documents should be valid according to some version of HTML (preferably XHTML) and self-sufficient, i.e. if CSS is used, it should be, preferably, embedded in the HTML file(s) or stored together with them.

XML documents

By far the most common format of submissions is XML (.xml), which allows for richly and hierarchically structured text data. CLARIN.SI accepts any valid XML documents, where:

the schema, that is used to validate a document is well-known and publicly available from a stable location, which includes the documentation, e.g. RDF/XML (.rdf) or ELAN (.eaf);
or the schema, including its documentation, is a part of the repository entry.

We accept the schemas in any XML schema definition languages, i.e. DTD (.dtd), RelaxNG (.rng/.rnc) and W3C XML schema (.xsd), as well as Schematron (.xml)

TEI documents

The preferred XML encoding of CLARIN.SI repository entries is TEI (.tei/.xml), i.e. using the Text Encoding Initiative Guidelines for encoding structured language resources, such as language corpora, machine readable dictionaries, text-critical editions, etc.

When the type of the deposited language resource is covered by any of the standard or best-practice customisations of the TEI, such as ISO 24624:2016 for transcriptions of spoken language, TEI Lex0 for dictionaries, or Parla-CLARIN for encoding corpora of parliamentary debates, these schemas should be used in preference to using bespoke or generic TEI encodings

If the deposited TEI documents use only standard modules of the TEI, in particular, if they can be validated according to the CLARIN.SI TEI schema, then only the XML TEI files can be deposited. But for document encodings that incorporate any extensions to the TEI, the TEI ODD and generated XML schemas (in particular RelaxNG .rng and .rnc) and documentation in at least HTML should always accompany such TEI XML data.

Linguistic annotation vocabularies

Most language corpora are annotated on various levels with linguistic categories. These categories must be documented, either on stable external URLs or together with the repository entry, i.e. in included files or, esp. with TEI-encoded corpora, as part of the corpus document itself.

The more commonly used annotation specifications in CLARIN.SI are:

Universal Dependencies: morphological features and syntactic relations
MULTEXT-East (@GitHub): morphosyntactic specifications, also available in various machine usable formats. For inclusion in XML you can use http://nl.ijs.si/ME/V6/msd/tables/msd-fslib2-*.xml
Janes Named Entity Guidelines