FAIR Guiding Principles
The FAIR Guiding Principles for Data Management and Stewardship, published by Wilkinson et al. (2016), provide a universal framework for data management based on the principles of Findability, Accessibility,Interoperability and Reusability. They have received international support and have been incorporated into relevant funding schemes like Horizon 2020.
Best Practices for Creating CMC Corpora
First and foremost, it is advisable to think about how to handle the data during and after a research project. In fact, many funding agencies already require researchers to prepare a data management plan (DMP) at the project proposal stage to formalise these thoughts, and even if it is not required, we recommend it as a reasonable first step. For preparing a DMP, good guidelines already exist, for example from the Digital Curation Centre in the UK, and many research institutes and universities have set up dedicated research data management offices to help researchers.
Regarding the FAIR principles, we want to emphasise that both the Findability and Accessibility principle can be realised by merely depositing the corpus in a research data repository, for example, a CLARIN Centre, which communicates the existence of the corpus to domain-relevant search engines, assigns a persistent identifier and allows the download of the data that may be restricted in access. It is also important to define a license for the corpus, which ideally does not prevent reuse, and to display this license in a prominent position. Most research data repositories prefer well known licenses, but also allow user-defined ones, and enforce an explicit choice.
Compared to Findability and Accessibility, the principles of Interoperability and Reusability are not immediately solved by depositing the corpus in a research data repository but are indeed specific to the community. First, research data must be stored in open and well documented formats. Here, the CMC community is responsible for developing and documenting common standard formats for CMC data. One important step has already been taken with the CMC core extension to the TEI P5 Guidelines, which was recently submitted to the TEI consortium as a feature request and will hopefully be adopted by more corpora in the future. Secondly, research data must have extensive metadata. In the case of CMC data, we consider it particularly important to provide information about the data provenance, that is, when the data were collected, what kind of data were collected and where it came from (e.g. Facebook, Twitter, blogs). In short, these best practices can be summed up by the following questions.
- Are my data findable through a search engine, for example, the VLO, OLAC?
- Does the corpus have a Persistent Identifier?
- Is there a clear license attached (that ideally permits reuse)?
- Are the data stored in an open and well-documented format?
- Do the metadata describe the data correctly and comprehensively, also covering the provenance of it?
Our community is in the fortunate position that work has already been undertaken and that the community as a whole, sees the need for and benefits of common standards for data formats, and procedures for data documentation. But we still need further targeted work, where we continue to discuss issues openly, keep formalising existing procedures and where we keep developing and exchanging our know-how.
We wish and hope that this CLARIN K(nowledge)-Centre and this documentation will help to bring our CMC corpora closer to the ideal of FAIR research data management.