Open Data Catalogues/2

From Open Knowledge Foundation

Jump to: navigation, search


Contents

Overview

The Data Catalogue Interoperability Workshop was a two day meeting at the Informatics Forum in Edinburgh on 3-4th May 2011, for those who administer open data catalogues (particularly those located within Europe). The meeting included discussion, design and development work aimed at:

The blog post announcing the meeting is here and the original registration page is here.

Keeping in touch

If you’re interested in joining further discussions about these topics, we'd encourage you to join our ckan-discuss mailing list.

You're also welcome to drop into the #okfn IRC channel on irc.oftc.net, where many of the CKAN developers and other interested parties can often be found.

Pre-meeting notes

See this thread on ckan-discuss for context.

Brainstorming for sessions

Ideas for sessions:

Agenda

Tuesday 3rd May 2011

Wednesday 4th May 2011

List of participants

Show and Tell

5 min show and tell of products/ideas

Gold Standard Best Practices Proposal

Catalog Model

This isn't the model your data catalog must follow. But if you make a view of your data catalog using this model, it will make it easy for other catalogs to import your records, and therefore make your datasets more discoverable. This is intended as an interoperable core. Catalog publishers can include arbitrary additional data in addition to the concepts mentioned here. Catalog consumers will likely ignore such extra information.

We can revisit these proposals once we have experience from real implementations

Data catalogs:

So far, interoperability is only defined on the dataset metadata level, not for collections and licenses, this will come later. The contents of a dataset is out of scope, we only care about dataset metadata.

Datasets:

Distributions:

Documentation:

Changes to DCAT to support the catalog model

dcat:permanentIdentifier

Add a new property on datasets specifically for an identifier that helps with deduplication when aggregating DCAT records from multiple catalogs:

> dcat:permanentIdentifier a rdf:Property; > rdfs:subPropertyOf dc:identifier; > rdfs:label "Permanent Identifier"; > rdfs:comment "A permanent, universally unique identifier for the dcat:Dataset.".

This could be a UUID or an HTTP URI or whatever else, as long as it is unique. The ID must be kept intact even if the underlying dataset or any part of the metadata is changed. Aggregators must keep the identifier unmodified, even if they modify or enrich the dataset metadata.

This is distinct from dc:identifier because dc:identifier doesn't have to be globally unique; it's just an identifier with *some* scope.

(note: Atom's atom:id is interesting here -- must be an IRI, "permanent and universally unique", can be resolvable but Atom processors MUST NOT assume that it is: http://tools.ietf.org/html/rfc4287#section-4.2.6 )

There's consensus that this is a good idea.

Subtypes of dcat:Distribution

We want to have two dimensions:

1. direct access vs. indirect access 2. download vs. API vs. other

DCAT currently offers four options, implemented as subclasses of dcat:Distribution: dcat:Download, dcat:Service, dcat:Feed, and unspecified (plain dcat:Distribution). The effect of using these subclasses is that they clarify what dcat:accessURL refers to. For a dcat:Download, dcat:accessURL would be the direct download URL. For a dcat:Service, it would be an endpoint URL. For a generic dcat:Distribution, one would expect that it points to a web page that describes how to get the data (e.g., click this link here.)

dcat:Download maps to direct access download. dcat:Service maps to direct access API. dcat:Feed would probably count as direct access API as well.

DCAT doesn't allow distinction between download/API/other for indirect access. All of those would use plain dcat:Distribution.

Perhaps: have properties dcat:accessURL for machine access and foaf:page for human-readable access information. A problem with that is that some catalogs don't provide information about what's behind the link. The link could point directly to a CSV file, or to a page that contains download instructions for a CSV file.

Perhaps: accessURL as the general property, and directURL with the same value if and only if we know that it's a direct download/endpoint URL.

This requires further discussion. Perhaps work through some examples and how they'd be best expressed.

Still need to discuss collection and license at a later date possibly, themes and tags.

Proposed Harvest Mechanism

Catalog Harvest Endpoint is a URL. eg.

Must contain the description of the Data Catalog in that format and the capabilities of the endpoint. Catalog metadata, filtering capabilities if any, other format endpoints. It may return all matched results in this page, or some or none as it chooses. If some or none then it must include a URL to get the next set of results if there are any more. The subsequent results pages (if any) may have the catalog description again (but may not). Every endpoint should specify every endpoint of other formats including itself. Clients are assumed to obey the crawl-delay specified in the ROBOTS.txt file

It may accept additional parameters of callback=xxx and from=XXX where this is the ISO date format of YYYY-MM-DDTHH:MM:SSZ if so this MAY limit the results (from the start of this time). Other filters may be supported and the repository must return these although it may ignore them and return all items matching or not.

For each catalog record, the only import information to be returned is:

A record may be any of a collection, a license or a basic dataset.

A consumer of this service can then use the "URL to obtain the record in the same encoding format" to do a get request to get the serialised DCAT for the combined "CatalogRecord and Dataset" metadata.

The broad goals of this are:

In either case we will document a recommended way of representing the underlying data in both turtle and JSON.

DataCatalogs.org

There are currently several lists of open data catalogues that are maintained by several different organisations. These include:

Why does this need to change? Currently:

Hence we would like to propose datacatalogs.org, which:

Phase 1 would enable manual entries, moderated by a group of administrators. Phase 2 would be a fully distributed harvesting model.

Key features for phase 1:

Action items

Phase 1

Action items to be completed by next meeting - in Berlin in August.

Phase 2

Photos

Photographs of the event are available at:

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox