Working Groups/linguistics/legal issues

From Open Knowledge Foundation

Jump to: navigation, search

Contents

Case studies

Experiences of specific groups with respect to legal, ethical, economical or community-specific hurdles to provide data under an open license.

Ubiqtionary (Uby, a lexical-semantic resource)

Ubiqtionary (Uby) is a large-scale resource integration project based on the Lexical Markup Framework (LMF, ISO 24613:2008). Currently, nine lexicons in two languages (English and German) have been integrated: WordNet, GermaNet, FrameNet, VerbNet, Wikipedia (DE/EN), Wiktionary (DE/EN), and OmegaWiki. Our LMF model - Uby-LMF - captures lexical information at a fine-grained level and employs a large number of Data Categories from ISOCat. All resources have been converted to Uby-LMF and imported into an SQL-DB. The Uby-API, a common Java software library, provides access to all data in the Uby-DB. The nine lexicons are densely interlinked using monolingual and cross-lingual sense alignments: WordNet-VerbNet, WordNet-Wikipedia, WordNet-Wiktionary, WordNet-OmegaWiki(DE) alignments are currently available. Future work includes adding further sense alignments between resource pairs. Uby will be publicly available for download soon.

Licenses of lexical resources and sense alignments

The lexical resources integrated in Uby come with different licenses:

The collaborative resources Wikipedia, Wiktionary, and OmegaWiki all have an open license that allows distribution and modification of the data.

Similarly, WordNet (http://wordnet.princeton.edu/wordnet/license/), VerbNet (http://verbs.colorado.edu/~mpalmer/projects/verbnet/license.3.0.txt), and FrameNet (http://creativecommons.org/licenses/by/3.0/) allow distribution and modification of the data. For WordNet and VerbNet the original license text has to be included in all distributed versions of the data.

The situation is different for GermaNet. GermaNet has a restrictive research license that prohibits both modification and distribution. We will tackle this restriction by providing a conversion tool along with the Uby release data, which allows to convert GermaNet to Uby-LMF, provided a GermaNet license has been obtained. In accordance with the GermaNet license, the Uby release will not contain any GermaNet data.

Regarding the licenses of sense alignments, the majority of the alignment data has an open license, e.g. VerbNet-WordNet is part of VerbNet, and the alignments between WordNet and the collaborative resources have an open license as well. Different is the situation again with the GermaNet-WordNet alignment which is part of EuroWordNet's Interlingual Index, see http://www.illc.uva.nl/EuroWordNet/licensing-ewn.html

PanLex (lexical translations)

The PanLex project aims to collect and publish all known (or, more accurately, asserted) lexical translations among all human languages, and intends to make these accessible to the public without imposing limitations on what the public may do with them. Please see a see Jonathan Pool's overview on intellectual property rights as encountered in this project.

E-MELD case studies (language documentation)

General information

pointers to relevant publications, excepts from discussions on mailing lists, forums (corpora list, etc.)

Overview publications

Copyright

Copyright is problematic especially for corpus linguistics. The copyright status of aggregated and processed corpora is often not clear.

Ethics and privacy

Researchers often want to have as much metadata as possible about their subject (age, sex, competence in other languages). This might touch upon privacy issues and can not always be recorded.

Linguistic and cultural data from developing countries are often felt to be in need of being protected from Western entertainment industry (Disney etc). The standard scenario is an animated motion picture from a sacred or taboo ritual dance broadcast in cinemas world wide, possible containing mythical knowledge reserved for Shamans or similar. This is the reason why many field workers insist on their data being CC-NC. It is important to acknowledge this fear regardless of whether one thinks that it is well-founded.

Towards recommendations for best practice

(This section should summarize conclusions from the preceding information, so far, this is still in its very early steps.)

Why/when publish data ?

In general, scientific data should be released under a license that allows other researchers to access this data. In fact, this is fundamental to science in data-oriented disciplines like linguistics and NLP. Otherwise, claims and conclusions coming out of the analysis of data cannot be verified, i.e., published results on this basis simply fail to obey elementary scientific standards.

But of course, data is valuable, and their aquisition may involve immense efforts. Researchers should thus have a right to exploit their data for their own purposes before they give access to their colleagues. But usually, the original research questions that a particular body of data was collected for can be answered within a few years. One may thus consider to create resources under the premise that the data is released after a specified period of time (e.g., after 5 to 10 years), such publication plans should be announced on project websites and in publications, and also details about how to access the data (license, prospective data host) at the time of writing (e.g., "please contact XY" or simply "not released yet") and after its final release should be provided for every data set described.

How publish data ? Licenses

Always include a license/copyright statement with your data. State your intentions clearly.

If you have a choice, go for a licence which complies with the http://opendefinition.org, because this guarantees that data can be accessed with minimal efforts by others. The [Panton Principles] formulate principles for open data in science and http://opendefinition.org/licenses lists conformant licenses. In particular, this means that CC-noncommercial and CC-noderivatives are ruled out.


If for some reason, you cannot follow the preceding advice, it is better to go for CC-NC/ND than to not indicate any information at all. But only a little better. CC-NC is about as bad as a traditional assertion of copyright as it severely hampers the possibility of your data to travel.

More information about licenses can be found at http://www.meta-net.eu/whitepapers/meta-share/licenses

Legal help

It appears that META-NET has a helpdesk for legal questions, although no one ever tried it. If you try it, please give a summary of your experiences below:

Please read more information about this service at http://www.meta-net.eu/whitepapers/meta-share/licenses before mailing them in order to avoid misunderstandings.

Other best practice recommendation sites

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox