Working Groups/linguistics/legal issues

= Case studies =

Experiences of specific groups with respect to legal, ethical, economical or community-specific hurdles to provide data under an open license.

Ubiqtionary (Uby, a lexical-semantic resource)
Ubiqtionary (Uby) is a large-scale resource integration project based on the Lexical Markup Framework (LMF, ISO 24613:2008). Currently, nine lexicons in two languages (English and German) have been integrated: WordNet, GermaNet, FrameNet, VerbNet, Wikipedia (DE/EN),  Wiktionary (DE/EN), and OmegaWiki. Our LMF model - Uby-LMF - captures lexical information at a fine-grained level and employs a large number of Data Categories from ISOCat. All resources have been converted to Uby-LMF and imported into an SQL-DB. The Uby-API, a common Java software library, provides access to all data in the Uby-DB. The nine lexicons are densely interlinked using monolingual and cross-lingual sense alignments: WordNet-VerbNet, WordNet-Wikipedia, WordNet-Wiktionary, WordNet-OmegaWiki(DE) alignments are currently available. Future work includes adding further sense alignments between resource pairs. Uby will be publicly available for download soon.

Licenses of lexical resources and sense alignments

The lexical resources integrated in Uby come with different licenses:

The collaborative resources Wikipedia, Wiktionary, and OmegaWiki all have an open license that allows distribution and modification of the data.

Similarly, WordNet (http://wordnet.princeton.edu/wordnet/license/), VerbNet (http://verbs.colorado.edu/~mpalmer/projects/verbnet/license.3.0.txt), and FrameNet (http://creativecommons.org/licenses/by/3.0/) allow distribution and modification of the data. For WordNet and VerbNet the original license text has to be included in all distributed versions of the data.

The situation is different for GermaNet. GermaNet has a restrictive research license that prohibits both modification and distribution. We will tackle this restriction by providing a conversion tool along with the Uby release data, which allows to convert GermaNet to Uby-LMF, provided a GermaNet license has been obtained. In accordance with the GermaNet license, the Uby release will not contain any GermaNet data.

Regarding the licenses of sense alignments, the majority of the alignment data has an open license, e.g. VerbNet-WordNet is part of VerbNet, and the alignments between WordNet and the collaborative resources have an open license as well. Different is the situation again with the GermaNet-WordNet alignment which is part of EuroWordNet's Interlingual Index, see http://www.illc.uva.nl/EuroWordNet/licensing-ewn.html

PanLex (lexical translations)
The PanLex project, sponsored by The Long Now Foundation, aims to collect and publish all known (or, more accurately, asserted) lexical translations among all human languages, and intends to make these accessible to the public without imposing limitations on what the public may do with them. The data are based on thousands of lexical resources. Legal issues arising from this use of lexical data are discussed in the “Content ownership” section of “Sourcing in PanLex”.

E-MELD case studies (language documentation)

 * [case studies, with a focus on technical rather than legal aspects]

= General information = pointers to relevant publications, excepts from discussions on mailing lists, forums (corpora list, etc.)

Overview publications

 * legal aspects connected to several crucial phases of handling text resources were discussed on the Digital Humanities 2007 session on Digital Text Resources for the Humanities – Legal Issues, available under http://www.sfb441.uni-tuebingen.de/c2/overview.html

Copyright
Copyright is problematic especially for corpus linguistics. The copyright status of aggregated and processed corpora is often not clear.


 * Germany:
 * a comprehensive overview (in German) can be found in: Timm Lehmberg, Christian Chiarcos, Georg Rehm, Andreas Witt. (2007) Rechtsfragen bei der Nutzung und Weitergabe linguistischer Daten. In: Datenstrukturen für linguistische Ressourcen und ihre Anwendungen - Data Structures for Linguistic Resources and Applications. Proceedings of the Biennial GLDV Conference 2007, Georg Rehm, Andreas Witt, Lothar Lemnitzer (eds.), Narr, Tübingen 2007
 * (note that German copyright law has been modified after this publication)

Ethics and privacy
Researchers often want to have as much metadata as possible about their subject  (age, sex, competence in other languages). This  might touch upon privacy  issues and can not always be recorded.

Linguistic and cultural data from developing countries are often felt  to be in  need of being protected from Western entertainment industry  (Disney  etc). The standard scenario is an animated motion picture from a sacred  or taboo ritual dance broadcast in cinemas world wide, possible   containing mythical knowledge reserved for Shamans or similar. This is  the reason why many field workers insist on their data being CC-NC. It  is important to acknowledge this fear regardless of whether one thinks   that it is well-founded.


 * Canada: We have what appears to me to be a rather restrictive ethical review policy based on the this document: http://www.pre.ethics.gc.ca/pdf/eng/tcps2/TCPS_2_FINAL_Web.pdf, and I'm sure others face similar constraints. It requires among other things, that people from whom data is gathered be notified of the particular research purposes for which that data will be used (no "such as" allowed). It also requires that contributors be able to withdraw at any point. Finally, it typically requires that data be destroyed after the study is completed, although it does allow that research ethics boards "should not automatically impose a requirement that researchers destroy the research data. Stored information may be useful for a variety of future purposes." (Brett Reynolds, Corpora List, Apr 5th, 2011)
 * US
 * IRB requirements are indeed restrictive for audio, video, and text. I collected SMS text messages with IRB approval, but I had to delete all identifying information from the data.  I also had to obtain signed consent from the parents of teenagers.  Furthermore, several of those who kindly downloaded their messages and contributed  to the development of my specific type of  SMS corpus requested that I give them copies of any articles published based on the data they had contributed.  I complied with all the requirements, but all these obstacles are discouraging many from undertaking innovative research.  After all, most of us are not involved in invasive medical experiments. (Susanna M. Sotillo, Corpora List, Apr 5th, 2011)
 * Regarding corpora of spoken language, one approach is to adopt a PUBLICATION model. From the beginning you tell people that the transcriptions and audio will be published. (This is what we did with the Santa Barbara Corpus of Spoken American English, published by the Linguistics Data Consortium http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000S85;see also http://www.linguistics.ucsb.edu/research/sbcorpus.html.) Once it is published, it is simply out there on CD's etc. in libraries etc., and can be used for eternity like any other published document you encounter in the library. Obviously this requires appropriate consent from the beginning, but it is worth going to the trouble to get it. (John Du Bois, Corpora List, Apr 10th, 2011)
 * EU
 * UK: Following up my last, responding specifically to John Du Bois's suggestion that publishing one's speech corpus, and saying in advance that one is going to do so, solves the problem: I can't see that it does.  Suppose (to take an actual example from BNC) that you have a couple of teenage girls chatting informally, and saying that one of their schoolmates (not participating in the conversation, but identifiable from various clues) behaves like a whore.  The speakers might be happy to sign your release form for publication; does that make the implications for the other girl all right?  Maybe, under US law, you would be legally safe; you certainly would not be under European law, and although it goes against the grain to say anything good about the European Union, in this case I think it has a point. (Geoffrey Sampson, Corpora List, Apr 11th, 2011)
 * Norway: All corpora containing audio and video are considered to contain sensitive data, and we are not supposed to use the corpus data for other purposes than have been planned and, importantly, conveyed to the informants beforehand. The person protection law is very strict for corpora that contain no secrets and no personal information (except the fact that the informants can be recognised). However, we have managed to find a solution that seems to work - in cooperation with the state institution that deals with the person act and research data. We have agreed with them that if we tell the informants whose speech will be part of a corpus that the goal for this particular research project is to develop a speech corpus for linguistic research, then it is the corpus that is the research project rather than particular research questions. We also do not have to dispose of the data when the project is finished because we have given the project a very long life time - eternal. (Janne Bondi Johannessen, Corpora List, Apr 5th, 2011)

= Towards recommendations for best practice =

(This section should summarize conclusions from the preceding information, so far, this is still in its very early steps.)

Why/when publish data ?
In general, scientific data should be released under a license that allows other researchers to access this data. In fact, this is fundamental to science in data-oriented disciplines like linguistics  and NLP. Otherwise, claims and conclusions coming out of the analysis of data cannot be verified, i.e., published results on this basis simply fail to obey elementary  scientific standards.

But of course, data is valuable, and their aquisition may involve immense efforts. Researchers should thus have a right to exploit their data for their own purposes before they give access to their colleagues. But usually, the original research questions that a particular body of data was collected for can be answered within a few years. One may thus consider to create resources under the premise that the data is released after a specified period of time (e.g., after 5 to 10 years), such publication plans should be announced on project websites and in publications, and also details about how to access the data (license, prospective data host) at the time of writing (e.g., "please contact XY" or simply "not released yet") and after its final release should be provided for every data set described.

How publish data ? Licenses
Always include a license/copyright statement with your data. State your intentions clearly.

If you have a choice, go for a licence which complies with the  http://opendefinition.org, because this guarantees that data can be accessed with minimal efforts by others. The [Panton Principles] formulate principles for open data in science and  http://opendefinition.org/licenses lists conformant licenses. In particular, this means that CC-noncommercial and CC-noderivatives are ruled out.

If for some reason, you cannot follow the preceding advice, it is better to go for CC-NC/ND  than to not indicate any information at all. But only a little better. CC-NC is about as bad as a traditional assertion of copyright as it severely hampers the possibility of your data to travel.

More information about licenses can be found at http://www.meta-net.eu/whitepapers/meta-share/licenses

Legal help
It appears that META-NET has a helpdesk for legal questions, although no one ever tried it. If you try it, please give a summary of your experiences below: Please read more information about this service at http://www.meta-net.eu/whitepapers/meta-share/licenses before mailing them in order to avoid misunderstandings.
 * helpdesk-legal@meta-share.eu

Other best practice recommendation sites

 * [EMELD: best practices for language documentation]