About the CLARIN.SI Repository and its Policies
- About the Repository and its Policies
- Mission Statement
- Repository Data Ethics
- Citing Policy
- License Agreement and Contracts
- Legal Issues Concerning Source Data
- Metadata Policy
- Preservation Policy
- Terms of Service
About the Repository and its Policies
The CLARIN.SI research infrastructure is the Slovenian national node of the European Research Infrastructure for Language Resources and Technology CLARIN ERIC and provides, as one of its services, the CLARIN.SI repository of language resources and tools.
The CLARIN.SI digital repository platform is hosted at the Jožef Stefan Institute (JSI), the largest research institute in Slovenia. JSI cooperates in the development of the platform and that of connected services with other institutions that are members of CLARIN ERIC and of the CLARIN.SI consortium.
In the repository we follow the standard principles of a high quality digital repository, such as the usage of persistent identifiers, a federated authorisation and authentication mechanism and structured, standardized and accessible metadata data formats.
The policies presented here describe the methods and practices used in the preservation of datasets, metadata and data integrity. They integrate with existing IT and network security policies at JSI, the Slovenian National Research and Academic Network ARNES, and with the Slovenian National Supercomputing Network consortium SLING, and implement best practice guidelines and standards set forth by CLARIN and Core Trust Seal.
CLARIN.SI repository policies have the following goals:
- Ensuring the usability and accessibility of the deposited data, including FAIR principles.
- Providing a well maintained metadata and data acquisition, publication, and archival path.
- Ensuring that all datasets have complete and accurate metadata and are discoverable using the metadata.
- Enforcing licencing of the datasets and restricting their use according to licences provided.
- Ensuring the quality and integrity of datasets in the repository both with physical and digital security.
The objective of CLARIN ERIC is to advance research in humanities and social sciences by giving researchers unified single sign-on access to a platform which integrates language-based resources and advanced tools at a European level. This is implemented by the construction and operation of a shared distributed infrastructure that aims at making language resources, technology and expertise available to the humanities and social sciences research communities.
The designated community of CLARIN.SI is the national and international research community, in particular researchers involved in digital humanities, corpus linguistics, computational linguistics, and other fields that produce language data or utilise such data or natural language processing tools. In the case of resources with appropriate licences (such as CC-BY), the repository is also interesting for companies developing applications in the area of language technologies.
While the data for many more languages is available, the principal languages that CLARIN.SI covers are Slovenian, Croatian and Serbian, so researchers and companies dealing with these languages are our primary consumers.
Repository Data Ethics
The CLARIN.SI repository ensures, to the extent possible, that data are created, curated, accessed, and used in compliance with disciplinary and ethical norms. To achieve our mission statement, we set out some ground rules regarding access and use in the Terms of Service.
Data in the CLARIN.SI repository are made available under the licence attached to the resources. In any publication users must acknowledge the Deposited Work using its persistent identifier (see Citing Data), its original author(s)/creator(s), and any publisher, where applicable. Full items must not be harvested by robots except transiently for full-text indexing or citation analysis. Full items must not be sold commercially (unless they are explicitly granted by the attached licence) without formal permission of the copyright holders.
The submitters acknowledge during the submission that they have the right to distribute the data and that they also have the right to grant the repository permission to distribute the data on their behalf. Acknowledgement that submitter has the right to distribute the data in the first place includes resolving possible legal issues, because if these are not resolved the submitter does not have the right to distribute the data.
Read our Citing Data Policy in order to learn how the data from the CLARIN.SI repository must be cited.
The entries in the repository often refer to publications available on the Web that describe the data of the entry in more detail. For the URIs of these publications we strongly encourage the use of Permanent Identifiers (e.g. DOI), whenever they are available.
License Agreement and Contracts
CLARIN.SI distinguishes three types of contracts.
- For every deposit, we enter into a standard contract with the submitter, the "Deposition License Agreement", in which we describe our rights and duties and the submitter acknowledges that they have the right to submit the data and gives us (the repository centre) the right to distribute the data on their behalf.
- Everyone who downloads data is bound by the licence assigned to the item - in order to download protected data, one has to be authenticated and needs to electronically sign the licence. A list of available licenses in our repository can be found here.
- For submitters, there is the possibility for setting custom licences to items during the submission workflow.
Legal Issues Concerning Source Data
As mentioned in the section License Agreement and Contracts, we require the depositor of data or tools to sign a Distribution License Agreement, which specifies that they have the right to submit the data and gives us (the repository centre) the right to distribute the data on their behalf. This means that depositors are solely responsible for taking care of IPR issues before publishing data or tools by submitting them to us.
Submissions are reviewed by the repository staff (editors). For language data, in particular language corpora, three legal issues are considered. The first is the copyright on the original texts, which in Slovenia is regulated by the Copyright and Related Rights Act, ZASP; the second is the protection of personal data, regulated in Slovenia by the Personal Data Protection Act, ZVOP-1 as well as GDPR, and the third - for corpora, where the texts have been harvested from social media platforms, for example - is the terms-of-use of the owner of the platform. The first two issues are also balanced against the public good that follows from releasing the dataset. If the editors are in doubt about the compliance of the dataset with applicable laws or regulations, they request more information from the submitter or refuse to publish the submission. If special conditions apply, they can be addressed in a distribution license tailored specifically for the particular item. There is also the possibility, in cooperation with the submitters, to use various tools to modify the data so that they do not infringe laws, e.g. to release only samples of integral texts or to shuffle their sentences to avoid copyright infringement while retaining linguistic content, or to anonymise texts using named entity recognition tools to locate and replace personal data to avoid its disclosure or to implement the right to be forgotten, which is esp. relevant for corpora containing older newspaper texts.
The repository so far has no submissions containing confidential data or data with disclosure risk which cannot be anonymised, and we do not expect this to change in the future. Most of our data is Open Access or distributed under similar public licenses, in particular variants of the Creative Commons licences. Substantially less data is available under custom licenses, which are, however, still rather permissive (e.g. academic restriction and no redistribution). Given that the mission of the repository is to make widely available data, we currently do not accept items that would contain confidential data or data with disclosure risk. It should be noted that the CLARIN Legal and Ethical Issues Committee (which CLARIN.SI is a member of) organises training sessions in the legal and ethical management and distribution of text data.
Deposited content must be accompanied by sufficient metadata describing its content, provenance and formats in order to support its preservation and dissemination. The repository metadata are freely accessible and are distributed in the public domain (under CC0). However, we reserve the right to be informed about commercial usage of metadata from the CLARIN.SI repository including a description of such use cases via the CLARIN.SI repository Help Desk.
CLARIN.SI is committed to the long-term care of items deposited in its repository and strives to adopt the current best practices in digital preservation.
Repository data integrity and authenticity
Users can deposit new items by using a web-based submission workflow or engaging directly with an editor where the editor performs the deposition. Only registered users can deposit items and the registration can be performed only when users have an academic account at one of the member institutions of our identity federation. Thus the academic institutions are responsible for verifying the user identity. Provenance information is kept for each repository item from the moment the item is created. Once the item is approved, only the administrators are able to change its (meta)data. The data producers can refer to the Deposited Item Lifecycle page to get acquainted with the details or ask our Help Desk directly.
The editors review the deposited items with the help of tools that are made available to further validate the metadata (e.g., checking URLs in the metadata), and set the appropriate level of support for the submitted file formats. In this way, the deposited items are reviewed by qualified personnel to ensure the presence of necessary metadata, the compliance of data formats with accessibility, best practices and long-term preservation requirements, as well as data integrity and quality and any potential legal issues.
To verify that a digital object has not been altered or corrupted, the repository periodically ensures the integrity of the data at all stages of its life. The checks include verification of the md5 checksums of the objects, checking that all required metadata are present, and testing that URLs are working. The results of these weekly checks are automatically sent to the repository staff.
We do not support changing submitted data. To implement any change in the data, a new version of the dataset must be created as a new repository item. In this way, we can enforce reproducibility of results using the dataset and a clear meaning of what a PID (persistent identifier) refers to. The new and the old version(s) have a relation added to their metadata and are visually represented on the web page. Metadata, specifically small fixes, can be changed and new relationships added when a new version of a dataset is introduced. An audit trail is automatically maintained by the repository for all operations on a dataset, including changes to the metadata, which are recorded in the provenance metadata.
Responsibilities of repository staff
Access to repository administration functions is strictly limited to authorized staff. All staff involved with repository maintenance and daily operations have well defined roles and are made aware of the present policies and their roles in assisting and in implementing this preservation policy as appropriate to their roles and responsibilities.
Repository editors are members of staff that are proficient in preservation policy, deposition and metadata requirements, data format compliance with accessibility, long term preservation requirements and other best practices, as well as data integrity, data quality, legal and licensing requirements.
Operational continuity and disaster recovery
The CLARIN.SI repository infrastructure is hosted in two physically separate computer centres at the JSI campus. The JSI provides network security, border monitoring and protection (firewalls, logging, security advisory and assessments). The datasets we consider for security and preservation consist of multiple components: (1) submitted datasets (files or bytestreams), (2) metadata for the repository and datasets, (3) repository software and its configuration, (4) the underlying operating system instances along with their configuration and logs for the repository and related services, and (5) exported backups of configuration and databases for related service instances.
Each component has its own data security and backup policy and implementation. Specifically, system images are checkpointed before any configuration changes or updates and regular replication of system images to a secondary location is performed. Files representing datastreams are backed up as independent files. All databases undergo regular daily database exports which are backed up and replicated by a different mechanism from the operating system image backup. The same approach to database consistency is implemented for databases used in service instances, with the exception of specialized databases automatically created from available datasets where the original data and transformation scripts are the main back-up strategy.
In order to ensure consistent operation aligned with modern standards and best practices, the repository follows a regular upgrade cycle and, where possible, existing and widely accepted best practices. Currently, this is implemented inside the secure high-availability environment of the JSI IT infrastructure and by the use of the widely deployed CLARIN-DSpace platform, which is, in turn, based on the well supported open-source DSpace platform, but adapted for archiving and distributing language resources. The platform is available on GitHub at https://github.com/ufal/clarin-dspace and the fork for CLARIN.SI at https://github.com/clarinsi/clarin-dspace.
DSpace is based on the OAIS reference model and the implementation follows a list of standards that are relevant for the CLARIN community. In case of a future change, CLARIN.SI repository is bound to fulfil similar requirements.
The CLARIN.SI implementation is an adapted and localized version of the CLARIN/LINDAT repository distribution. It tracks its production-grade upstream version and is deployed in virtualised OS instances on an cluster consisting of multiple application servers that is configured to provide a fault-tolerant environment supporting multiple instances. This includes a beta-server instance that is used to test upgrade cycles, as well as a data storage system with a hardware RAID controller and RAID-6 configured for high-grade loss prevention and versioned snapshots, replication and back-ups. In addition, the environment supports a distributed filesystem running on application server local disks that supports high-availability within the cluster for local volumes, file-level back-ups as well as application-image level back-ups.
Submitted datasets are stored in a reliable way in multiple locations, i.e. stored as DSpace repository bitstreams in the repository store on a network-attached volume managed by one of the application servers and replicated on another server, while the data itself is stored on a distributed high-availability file store.
Repository and dataset metadata is stored both in production-ready form (i.e., in a virtualised PostgreSQL instance inside the production-level virtual machine) and in a backup and import compatible form (i.e., a database text-dump).
Repository software and configuration is tracked with the Git version control system that permits roll-back and multiple versions and is independent of the application environment.
Each active virtual machine instance for the CLARIN.SI repository and related services is cloned or snapshotted and backed-up regularly and especially before any software configuration changes or updates, to ensure an atomic and clean roll-back to a consistent state without any data loss. In addition, configuration changes and software updates are tested on a hot beta instance with secondary instances available in case of system failure.
Additional application servers are made available to support possible complete relocation as a contingency measure in case of failure of the application server infrastructure or technical issues in the designated server room. Virtual machine image backups, dataset backups and database export backups are cloned to a secondary backup system in a different location to allow for full recovery in case of data centre failure. Backups are performed regularly and both on the level of virtual system images and on the level of exported data (files and text-format database exports) on-site and replicated on a secondary location. The file-based backups and VM image backups use two different implementations to avoid single points of failure. Database export files and filesystem snapshot-derived files are used whenever possible to ensure atomicity and integrity of data, with full VM image backups available for reconstruction of services. The two backup systems use different servers and different backup locations to avoid a single point of failure in the back-up and restoration procedures. All backups follow standardised backup recommendations, including checksums for ensuring file integrity and automatic monitoring tools to ensure functionality on various levels, including but not limited to a complete reconstruction of the service in a different environment and transfer of all datasets to another instance of the CLARIN-DSpace platform.
Repository data preservation plan
CLARIN.SI has the right to copy, transform, store and provide access to the data. The preservation function encompasses: taking delivery of the dataset ingested, storing it, and ensuring it is archived, accessible and usable to the research community.
DSpace, and thus the CLARIN-DSpace repository software, provides two levels of digital preservation. The first level is bit preservation, which ensures the integrity of both data and metadata over time regardless of possible changes in the physical storage media; the second is functional preservation: a file may change over time so that it remains usable by evolving its original digital format and media. Format migration is a straightforward strategy for functional preservation.
The preservation strategy is implemented in the functional concepts of the Open Archival Information System (OAIS) reference model for digital preservation environments. During the ingest phase, data depositors are presented with a user interface divided into logical steps. Among the steps are included: data upload where data depositors are urged to use formats and standards recommended by CLARIN and CLARIN.SI, information about the legal issues including signing of the distribution agreement, and assisted selection of an appropriate licensing model.
All the information is verified by editors during the review step including the file format selection. CLARIN.SI performs regular checks on the metadata and data (e.g. completeness, checksums) and may request additional information from the depositors. Occasionally, minor metadata modifications (e.g., correcting grammar mistakes, unifying keywords) can be done also after the item has been published. All the changes, including ones made by editors, are recorded in the provenance metadata. The general policy of the repository is to disable deleting of metadata which is crucial for long term preservation. Automated reports also help us identify possible issues with long term preservation. This includes extensive automated weekly reports for the whole repository that are reviewed by the repository staff. An important policy for our repository is that the metadata of a resource is public.
Language data is complex, as it can be in various modalities (writing, speech, video), and heavily annotated with complex structures, such as an entry in a comprehensive dictionary, or a syntactically and semantically annotated corpus. The CLARIN.SI repository requires certain ways to structure the data and the usage of specific file formats. The guiding principles for format selection are: open standards are preferred over proprietary standards, formats should be well-documented, verifiable and proven, text-based formats are preferred over binary formats, and in the case of digitization of analogue signal lossless or no compression is recommended.
All metadata and data have a persistent identifier (PID) and metadata can be converted to self-explanatory and human-readable XML files.
CLARIN.SI repository staff is tasked to ensure that the administration, maintenance and management is carried out according to the established best practices and guidelines at the JSI, the CLARIN.SI node, the EU CLARIN infrastructure, as well as relevant Slovenian groups, such as "RDA node Slovenia". For this reason, technical measures and guidelines have to be updated and kept current and our staff members are encouraged to participate in various CLARIN ERIC committees, including the Standing Committee for CLARIN Technical Centres, the Legal and Ethical Issues Committee, the Standards Committee and the User Involvement Committee. In addition, the CLARIN.SI repository software is kept current and up-to-date; the CLARIN.SI repository team is in regular touch, via email, a dedicated Slack channel and GitHub issues with CLARIN/LINDAT, the developers of the DSpace CLARIN platform, to ensure that the local deployment is safe, current and in accordance with best practices. CLARIN.SI not only reports on problems and suggests improvements to the platform, but also contributes to its development, as well as to the development of standards used for encoding language resources.
The technical team and the editorial team involved in the CLARIN.SI repository are expected to regularly attend conferences and workshops with a focus on language resources, such as the Language Resources and Evaluation Conference, and generally follow new developments and best practices in the field.
Continuity of access
The Slovenian CLARIN national centre CLARIN.SI is financed by the Slovenian Research Agency under its ESFRI Infrastructures Programme. The current level of funding is sufficient to maintain the repository system and other CLARIN.SI web services, and to continue developing and improving them, as well as data security at least at the current level.
CLARIN.SI has in place measures to preserve data access in case of unexpected emergency budget cuts. The CLARIN repository platform is a very low maintenance system, easy to keep running, while the group that administers the repository is employed under regular contracts at the hosting institute or at one of the CLARIN.SI consortium partners. The same holds for other Web services offered by CLARIN.SI, such as its web concordancers. Thus, if CLARIN.SI funding would be interrupted, the hosting institute would be able to keep its services running without dedicated funding for a substantial time, certainly for at least five years, and most likely would continue to accept new submissions to the repository as well.
The CLARIN.SI repository is open source software and the repository platform is already used by eight CLARIN centres, which allows for simple migration of all the data from one CLARIN DSpace repository to another while keeping the records accessible under the same PIDs and with the exact same feature set. Therefore, if, in the worst case scenario, the funding for the CLARIN.SI infrastructure would be terminated completely and no alternative funding to at least maintain the repository in its current form could be found, one of the other CLARIN centers would be able to host our data and to reconfigure its permanent identifiers for the CLARIN.SI collection; in particular, we have a signed agreement with the Czech LINDAT-CLARIAH-CZ centre for such a migration.
Terms of Service
To achieve our mission statement, we set some ground rules in the Terms of Service. By accessing or using any kind of data or services provided by the Repository, you agree to abide by the Terms contained in this document.
Data in the CLARIN.SI repository are made available under the licence attached to the resources. In any publication users must acknowledge the Deposited Work using its persistent identifier (see Citing Data), its original author(s)/creator(s), and any publisher, where applicable. Full items must not be harvested by robots except transiently for full-text indexing or citation analysis. Unless explicitly granted by the attached licence, full items must not be sold commercially without prior formal permission of the copyright holders.