Frequently Asked Questions



What is the CLARIN.SI repository?

The CLARIN.SI repository is like a library for linguistic data and tools. It is provided by CLARIN.SI, the Slovenian Common Language Resources and Technology Infrastructure.

You can use the repository to:

  • search for linguistic data and tools and easily download them;
  • deposit your data or tools being sure that it is safely stored, and that everyone can find it, use it, and correctly cite it (giving you credit).

What kinds of submissions does CLARIN.SI accept?

We accept any linguistic and/or NLP data and tools: corpora, treebanks, lexica, machine readable dictionaries etc. We also accept trained language models, parsers, taggers, MT systems, etc. CLARIN.SI repository also supports e-signature of licences for immediate access to restricted resources.

When uploading language resources, please use one of the recommended formats mentioned in Language Resource Standards.

Do I need to create an account to download and/or make a submission?

  • You can download data and tools which have a license that allows free sharing without a log-in. Please read the license prior to downloading the resource. This applies to all data with one of the Creative Commons licences and to tools with open source licenses.
  • To download data and tools that require you to sign a license, you need to log in. To make a submission, you also need to log in. However, if you are affiliated with an academic institution you probably don't need a new account.
  • Just click "Login" in the upper right corner and search for your academic institution. To log-in, you can use any account with an Identity Provider that is a member of EduGAIN federation.
  • If you don't have an academic account, you can get one through CLARIN ERIC.

I see an error logging in.

Please let us know through our Help Desk, if you have any trouble logging in.

Occasionally, usually when you are the first one logging in using your home institution, you might see an error stating "The authentication was successful; however, your identity provider did provide neither your email, eppn nor targeted id." This means your home institution did not send us enough data about you to operate our service; the institution is doing so to protect your personal data. We only require an email and we are following Data Protection Code of Conduct, which helps us convince the institution we won't abuse data about you.

If you have an account with multiple providers and you login with different one each time, you might see error stating "Your email is already associated with a different user". Please try to use the same provider each time. However, if that is not possible, let us know and we will change your default provider.

Why should I submit my data into CLARIN.SI repository?

Here are some of the reasons why submitting to CLARIN.SI repository is a good choice.

  • It is free and safe.
  • We respect your license. We encourage Free Data and believe it benefits not only the users, but also the data providers. However, we accept also more restricted data and we can make users sign a license before downloading your data, if that is what you need.
  • The data is visible, giving you maximal credit for your work, also via other services, such as Google, VLO, DataCite, OLAC, Data Citation Index, and arXive.
  • The data is easy to cite. We provide ready-to-use one-click citations in BibTex, RIS, and other popular reference formats. All the citations include permanent links created from persistent identifiers (we use handles for PIDs). These PIDs are future-proof.
  • For some data types (in particular, text corpora), we can provide additional services, like a concordance search.

Why should I submit my tools into CLARIN.SI repository?

Is there any common search tool across different CLARIN repositories?

Yes, in particular the CLARIN Virtual language observatory or CLARIN VLO for short. This browser helps you find linguistic resources, services and tools provided by CLARIN as well as some other repositories. However, please keep in mind that CLARIN VLO is an aggregator that provides information about a specific resource, but that the source repository, like CLARIN.SI typically gives more information, so it is worth checking also the source repository landing page, as given by CLARIN VLO. Original resources, services and tools are hosted by CLARIN centres and other data providers. This means that you cannot use the services and tools, or search and analyse the resources directly through CLARIN VLO.

There are also other aggregators, such as OpenAIRE, which is a pan-European information platform and a network of Open Access repositories, or re3data, which is a global registry of research data repositories. The main difference between CLARIN VLO and OpenAIRE or re3data is that CARIN VLO focuses on the language resources, services and tools, whereas the other two cover all academic disciplines.

If I have my data or programs on GitHub, can I also deposit them on CLARIN.SI?

Yes, there are no constraints with regard to depositing GitHub data on the repository. It is good practice to explicitly mention the commit hash that the repository data corresponds to. For example, in this entry the GitHub commit URL is stored as the "Project URL", or you can mention the commit checksum in the description of the resource.

Why should I deposit my data with CLARIN.SI rather than with ELRA?

CLARIN and ELRA both provide repositories for language resources but CLARIN always distributes the deposited resources for free and licences are typically CC while ELRA mostly uses a commercial model of resource distribution and offers appropriate licences for such use. Also, the main target communities of CLARIN are those of the Humanities and Social Sciences, while ELRA is exclusively targeted at the Language Engineering community - which is not to say that CLARIN does not also provide many resources for Language Engineering as well.

What is the PID (handle) good for?

It is a special permanent URL. It provides a permanent link that will resolve correctly even if in some distant future the data is moved. Thus it should be used as URL in citations.

What is the actual depositing/archiving procedure?

During the submission of digital language resources to the repository, the data undergoes a curation process in order to ensure quality and consistency. We assist you in meeting necessary requirements for sustainable resource archiving. Data has to be provided with metadata in standard formats accepted/adopted in the respective communities, persistent identifiers (PIDs) have to be assigned, IPR issues have to be resolved and clear statements with regard to licensing and possible use of the resources are to be made. The depositor is also required to electronically sign a deposition agreement acknowledging that (s)he is the holder of rights to the data and that (s)he has the right to grant the rights contained in this licence. Once the data is deposited in the repository, it is assigned a PID for stable reference.

What if I want/need to update the archived data?

Every change to the resources and metadata should be stored as a new version with a new PID. However if the changes are minimal (e.g., typos or clear mistakes) then contact our Help Desk with the submission PID and the changes which should be made. It is up to the reviewer to decide whether these changes should result in a new version or not.

What if I want to withdraw the resources in the future? Can I delete the data?

Yes, in this case contact our Help Desk with the submission PID and the reason. However, we need to keep a reference that the data was in our repository (because a persistent identifier was issued), so the administrative metadata will be retained indicating that the data itself were removed.

I don't want to/cannot make the data publicly available or want/can make them available only after a specific date. Could I still archive them with CLARIN.SI?

In accordance with the advocacy of the research infrastructures and the general development with respect to Open Access, we strongly encourage the data producers to be as open as possible. However, in certain circumstances we will archive your data even if they will not be publicly available. Please, contact our Help Desk prior to completing the submission.

How should I cite a submission?

See our policies.

How safe is my data, if I store it with CLARIN.SI?

Quite safe, probably much more than in your computer, because individual users, contrary to the CLARIN.SI infrastructure, usually do not have a sophisticated security and preservation plans and means for their implementation. Uploading data to CLARIN.SI repository therefore has two main benefits: you can worry much less about unauthorised access to your data, and you do not have to think about backups. The final version of your data (that you submitted to the repository) is always there for you.

Protection and preservation of data and software related to the repository is one of the priorities of the CLARIN.SI infrastructure. For this reason, CLARIN.SI is hosted on dedicated infrastructure at the Jožef Stefan Institute providing highly available storage, backup and disaster recovery for archival data and software as well as high network security.

Security and preservation requirements apply to the submitted datasets, their metadata, the exported backups and the complete software supporting the repository. Each of these components has its own data security and backup policy and implementation. Virtual machine image backups, dataset backups and database export backups are cloned to the backup system of the Department of Knowledge Technologies at a different location to allow for full recovery in case of data centre failure.

What license should I pick for my data/tool?

We encourage using a free license. A representative selection of free licenses as well as CC licenses (more appropriate for data) is available directly during submission. There is a great OPEN License Selector which can guide you through the selection of appropriate license.
If for some reason you need a different license, please contact us.

Where can I find more information about supported licenses?

The list of licenses currently supported is here. However, do not hesitate to contact us in case you need a specific license. The licenses can be accompanied by various requirements; e.g., limited to logged-in users, filling additional details (purpose), etc.

Why CLARIN.SI strongly prefers real authors to institutions?

It is not about contact, it is about citations, credit and trust. That is why we have separate metadata fields for authors and for contact person. Contact to a helpdesk is perfect, not acknowledging the authors of a scholarly work is not. We support the direct citation of data. This is why we give PIDs, create formatted citations, etc. This is also the reason why we really want real authors, so that they get citations and other scientists know whose work they rely on.

How do I get the most out of my searches?

In contrast to other search engines the CLARIN.SI repository one uses OR as a default operator; see examples below that clarify this. If you are not satisfied with the results of your searches, you might wish to go beyond plain text searches. You may search only in certain fields, use negation, add score (emphasis) to some parts of the query. The search engine is SOLR, so use its syntax if you know it or check the documentation.

Examples:

Slovenian lexicon vs Slovenian AND lexicon
The default operator is OR; i.e., the first example searches for Slovenian OR lexicon in all metadata text fields.
dc.title:C?C && -dc.title:training
Returns all items having C?C in the title – ? stands for any character (e.g., CMC) - and not having training in the title.
dc.title:"CMC training corpus"
Use double quotes (") for exact matches and multiword expressions.
author:(Erjavec AND -Ljubešić) AND language:(slovenian AND english)
Search for items by one author and not the other, while you specify that you are interested only in items that are both in Slovenian and English.