The case for downloadable datasets

[Draft for discussion.]

= Freely Licenced Downloadable Datasets =

Data should be made available as freely licenced downloadable datasets.

By "freely licenced" we mean that receiving the data permits people to use it without restrictive conditions.

By "downloadable datasets" we mean unstructured or structured data available to be transferred in their entirety to a user's own computing resources.

This is in contrast to providing restricted access to datasets on physical media or through Internet APIs.

= For Users =

Advantages

 * Small datasets can be loaded into powerful desktop software such as R and Processing in their entirety and explored interactively without network programming overheads.
 * Large datasets can be hosted on distributed filesystems and analysed using powerful large dataset analysis tools such as Hadoop and Pig.
 * Investment in the data cannot be lost if the dataset becomes unavailable from its source, as the user has a local copy or access to other people's copies.

Disadvantages

 * If the institution has not kept the data current then it may be less relevant or accurate, although it is in the institution's interests to keep data current for precisely this reason.
 * Storing large datasets can be inconvenient, although storage costs are decreasing all the time and services for hosting large datasets are available.
 * Downloading a dataset and preparing it for use require more work to use than accessing an API, although accessing an API can impose its own complexities and limitations.

= For Institutions =

Advantages

 * The dataset will spread further, receiving greater exposure and driving more attention back to the institution.
 * The dataset will receive more use, which can be leveraged using attribution to drive attention back to the institution.
 * Use of the dataset may discover unexpected or hidden features that are of interest to the institution or its potential audience, discovering new value in the institution's material resources or services and driving attention to it.
 * If the database is under a copyleft of share-alike licence, or if the use of the dataset is with other freely licenced datasets, data may be identified or contributed back that is of value to the institution.
 * Spread, use, discovery and contributions create network effects, raising the profile of the institution, driving attention to it, and thereby increasing demand for its material resources and services.
 * This allows the institution to sell updated data and reports (but still freely licenced downloadable datasets), access to the institution's collections, resources, sites and spaces, events at the institution identified by or centred around use od the data, services both using the data and in other areas, and merchandise both on and off-site.
 * If the institution is a museum, gallery or other space, increasing attendance, depth of engagement, and use of services are key. This is a way of achieving that.

Disadvantages

 * Loss of control of distribution, although broader distribution increases publicity, use and returns.
 * Potential loss of control of direct revenue, although the network effects of loss of control of distribution increase indirect revenue.
 * Possible use of the data by groups or in ways that the institution might disapprove of, although licences handle non-endorsement and new social norms are emerging that recognise mis-use of free resources reflects badly on the user rather than the provider.

= Licences For Datasets =

Free Licencing

 * Use a free licence as defined by the OKD.
 * This ensures that users know they can use the data and return the value of that use to the community and the institution.
 * Free licencing includes commercial use. Promotion, re-distribution and augmentation of data by well-resourced commercial organizations is a net gain to the institution and its community, and even commercial "free riders" will drive attention to and create demand for the dataset and its providing institution.

Use An Existing Licence

 * It is always tempting to create a custom licence to address the unique situation of an institution or dataset.
 * Custom licences may be incompatible with other datasets, reducing their usability and therefore their value both to user and the institution.
 * Licences written by organizations such as Open Data Commons or Creative Commons have been written by lawyers, publicly reviewed, used by many projects, and in some cases upheld by the courts.

Attribution

 * Attribution for data use is important for creating the network effects that make other people's use of your data valuable to you.
 * ODC's ODC-By, Creative Commons's Attribution licence (which isn't recommended for data but is used for it), and the UK Government's OGL all require attribution.

Copyleft/Share-Alike

 * Copyleft or share-alike on data is controversial, and it is harder to enforce than on cultural works, but where it does work it is a way of ensuring that data added to yours is made available back to you.
 * ODC's ODbL uses copyright, the EU database right, and contract law to create a share-alike with attribution for databases.

Waiving All Rights

 * Waiving all rights ensures the maximum distribution and use of data, although at the possible cost of direct attribution and returns of data.
 * ODC's PDDL and Creative Commons's CC0 allow you to waive all rights on data internationally.

= Summary =

Providing freely licenced downloadable datasets can reduce direct monetization possibilities and lose some control of the data. But they are of great utility to users. And that increased utility increases the reputational network effects that drive demand for the monetizable resources and services of the institution that provided the dataset.