Open Data Catalogues/2

Overview
The Data Catalogue Interoperability Workshop was a two day meeting at the Informatics Forum in Edinburgh on 3-4th May 2011, for those who administer open data catalogues (particularly those located within Europe). The meeting included discussion, design and development work aimed at:


 * improving interoperability between different types of data catalogues (including but not limited to those powered by CKAN);
 * allowing them to be federated more easily into projects like opendatasearch.org or publicdata.eu.

The blog post announcing the meeting is here and the original registration page is here.

Keeping in touch
If you’re interested in joining further discussions about these topics, we'd encourage you to join our ckan-discuss mailing list.

You're also welcome to drop into the #okfn IRC channel on irc.oftc.net, where many of the CKAN developers and other interested parties can often be found.

Pre-meeting notes
See this thread on ckan-discuss for context.


 * Overview of data catalog aggregator projects
 * OKFN's publicdata.eu: http://publicdata.eu/
 * OKFN's Open Data Search: http://opendatasearch.org/
 * OKFN's semantic.ckan.net: http://semantic.ckan.net/
 * Guardian's World Government Data: http://www.guardian.co.uk/world-government-data
 * Sunlight Foundation's National Data Catalog: http://nationaldatacatalog.com/
 * RPI's International Data Catalog Search: http://logd.tw.rpi.edu/node/9903
 * MetadataFlask's Open Data Directory: http://open.mflask.com/
 * meta catalogs? http://data-ac-uk.ecs.soton.ac.uk
 * Overview of proposed interoperability formats
 * CKAN API
 * dcat
 * ISO 19139 / CSW
 * VoID
 * “Just give us a CSV of your catalog”
 * What's the smallest set of fields you'd need?
 * OAI-ORE
 * TWC RPI convertor: csv2rdf4lod
 * Tim Lebo
 * Start here: https://github.com/timrdf/csv2rdf4lod-automation/wiki
 * Outreach
 * Towards a document with recommendations for data catalog operators
 * Identify high-value data catalogs that are currently not available in a structured form
 * Screenscrape them?
 * What's in it for me? - if we want people to do things a certain way we better explain to them why it's good for them (rather than just appeal to their better nature)
 * W3C's upcoming Linked Government Data working group (has a work item for dcat)
 * Specialisations of Catalogs
 * Such as
 * Research Data Catalog
 * Organisation Data Catalog
 * Government Data Catalog
 * Regional Data Catalog
 * are they special?
 * Best Practice
 * some basic dos and don'ts (with explanation of why)
 * rewards; if you follow this standard then the following tools will to funky things for you or add cool features to your site.
 * Some guidance on URIs for datasets and versions of datasets.
 * Specific technical challenges
 * Modeling of aggregated catalogs in dcat
 * How to keep aggregated catalogs up to date?
 * Now we have a catalog of all government data world-wide. So what? What's next?
 * What can we do beyond the usual search and simple faceted browsing of datasets?
 * Accessing data catalogs directly from data workbenches (e.g., Google Refine)
 * User needs? Who are the users anyways?
 * John's Google Doc spreadsheet (RPI vs. dcat vs. CTIC) http://bit.ly/ilCUfK
 * Hacking!
 * Implementing some new CKAN back-ends via our new Harvesting API to be able to support harvesting from and export to various other different catalogues. See the README at the bottom of this page: https://bitbucket.org/okfn/ckanext-harvest/src
 * John's Google Doc spreadsheet (RPI vs. dcat vs. CTIC) http://bit.ly/ilCUfK
 * Hacking!
 * Implementing some new CKAN back-ends via our new Harvesting API to be able to support harvesting from and export to various other different catalogues. See the README at the bottom of this page: https://bitbucket.org/okfn/ckanext-harvest/src
 * Implementing some new CKAN back-ends via our new Harvesting API to be able to support harvesting from and export to various other different catalogues. See the README at the bottom of this page: https://bitbucket.org/okfn/ckanext-harvest/src

Brainstorming for sessions
Ideas for sessions:


 * Introductions
 * Show and tell
 * Presentation of formats
 * DCAT (Richard)
 * CKAN API (James Gardner):
 * http://data.gov.uk/blog/announcing-datagovuk-catalogue-apis
 * https://bitbucket.org/okfn/ckanext-harvest/src
 * ISO 19139 (ISO 19135 no service)/ CSW (James Reid)
 * OAI (Chris)
 * (Dspace repository)
 * What is the minimum you have to do in each of these areas so we don't think you suck?
 * Catalog metadata
 * Application profiles - e.g. knowing that you can open a dataset using a certain tool (Chris)
 * Common Taxonomy for datasets (GEMET?)
 * Inadequate modelling leading to abuse of fields
 * What is a dataset? How to slice datasets. Naming conventions.
 * Describing the catalogue itself
 * Interoperability
 * Update mechanism (revision feeds eg atom, CKAN, or OAI)
 * Filtering/Querying
 * Publishing paths and Provenence (scripts etc..) graph Derived datasets/recombining sets of data
 * Scraping/Dealing with catalogues not in a nice form
 * Archiving datasets
 * Geospatial theme
 * Services theme
 * Best practice
 * Always pubishing the source metadata directly
 * Publish a dcat representation
 * People doing it as their day job and just doing it in RDF

Agenda
Tuesday 3rd May 2011


 * 1000 BST: Introductions
 * 1100 BST: Coffee break and preparation for Show and Tell
 * 1130 BST: Show and Tell
 * 1330 BST: Lunch
 * 1400 BST: Presentation of data formats
 * 1445 BST: Pain points discussion
 * 1500 BST: Move upstairs for discussion of definitions of datasets, data catalogues, etc
 * 1800 BST: End of day one

Wednesday 4th May 2011


 * 1000 BST: Quick summary
 * 1010 BST: Dublin Core proposal
 * 1030 BST: Define standard for (i) data structure / schema (application profile), (ii) how to express that over the web
 * 1200 BST: Update mechanism: push / pull, formats (Atom / CKAN / OAI)
 * 1230 BST: Discussion of RDF over lunch
 * 1300 BST: Making Stuff
 * Build CKAN DCAT harvester
 * DCAT validator that encodes assumptions we've discussed
 * Typing up notes
 * 1700 BST: Conclusion
 * Roundup
 * Next steps
 * Review action items
 * 1800 BST: Meeting ends

List of participants

 * Jonathan Gray @jwyg / Skype: jwygray - Open Knowledge Foundation / PublicData.eu - http://okfn.org
 * Community Coordinator at the Open Knowledge Foundation
 * LOD2.eu - funded by EC, core technologies for linked data
 * PublicData.eu - federating local, regional and national datasets from across Europe
 * OpenDataSearch.org - federating open datasets from around the world
 * James Gardner - Open Knowledge Foundation / PublicData.eu - http://okfn.org
 * CKAN Project lead - http://ckan.net/ and http://ckan.org/
 * INSPIRE - UK Location Programme
 * Very interested in Linked Data - and wants to make CKAN more useful for people who use/produce Linked Data
 * John Erickson @olyerickson - Tetherless World Constellation, RPI (USA) - http://tw.rpi.edu
 * Works with Jim Hendler
 * Linking Open Government Data (LOGD) Portal - http://logd.tw.rpi.edu
 * International Open Goverment Data Catalog. Demo: http://logd.tw.rpi.edu/demo/international_dataset_catalog_search
 * Fairly new initiative.
 * Apply custom metadata model.
 * RDF of catalogue metadata.
 * Hanif Rahemtulla - University of Nottingham
 * Geopatial Science Fellow
 * Interest in EU INSPIRE/Spatial Data Infrastructures
 * Provenance and and Linked Geospatial Data
 * How to integrate provenance in catalogs
 * Empowering communities to use Open Data
 * Link: http://www.nottingham.ac.uk/computerscience/people/Hanif.Rahemtulla
 * Cesare Gerbino - CSI Piedmont
 * Worked on geospatial data
 * Keen to figure out how to connect these two initiatives together - geospatial data and open data more generally
 * Saverino Reale - CSI Piedmont
 * Works on open data for Piedmont http://dati.piemonte.it/
 * Richard Cyganiak @cygri (DERI, NUI Galway, Ireland)
 * Background in Linked Data
 * One of the authors of dcat (Data Catalog Vocabulary, an RDFS vocabulary based on DC+FOAF+SKOS)
 * Using CKAN for cataloging RDF datasets
 * Integration of CKAN + Google Refine for open government data
 * Hoping to advance DCAT vocabulary - and to get this in W3C in new Working Group on Governmental Data
 * Using CKAN quite a bit for cataloguing Linked Datasets (used to generate LOD cloud diagram)
 * One of students is working on integration between CKAN and Google Refine
 * Wants to make progress on some of these things in next couple of days
 * Addy Pope
 * Works at EDINA
 * James Reid
 * Develops geospatial data catalogue
 * Chair UK Location Programme - which defines schema and practical implementation
 * CSW harvesting from data.gov.uk
 * Interested in production of Linked Data from catalogue - from XML (Will Waites is working on this)
 * Partially with government, partly with Location Programme
 * William Waites
 * Taking INSPIRE data sources and transforming them into RDF
 * Has worked with OKF
 * Christopher Gutteridge
 * Two hats: data.southampton.ac.uk and EPrints
 * Admin for: http://data.southampton.ac.uk/ - University of Southampton Open Data site - currently maintained in a
 * Particularly interested in archiving datasets produced by universities
 * http://blogs.ecs.soton.ac.uk/webteam/ - My team blog
 * http://blogs.ecs.soton.ac.uk/data/ - UoS Open Data blog
 * http://data-ac-uk.ecs.soton.ac.uk/ - Directory of University Open Data Sites.
 * Lead developer for eprints -- can we make it compatible with CKAN?
 * Martin Alvarez-Espinar
 * CTIC (http://datos.fundacionctic.org)
 * Linked Government Data Initiatives in Spain
 * Most catalogues in RDF in static files - some have SPARQL endpoints to query metadata
 * Involved in many initiatives in Spain - including national initiative in Spain
 * Interested in doing a proof of concept to federate data catalogues in Spain
 * Adrià Mercader (@amercader - http://amercader.net)
 * Part of CKAN team (work on the harvesting API)
 * Interested in geospatial data
 * Here to watch and learn!
 * Peter Winstanley
 * Works at Scottish government

Show and Tell
5 min show and tell of products/ideas


 * James Gardner
 * CKAN: http://ckan.net/ and http://ckan.org/
 * Whole data use cycle, not just access - enable people to clean up and add back datasets
 * Interoperability
 * New functionality will be released in CKAN 1.4
 * Adria Mercader
 * New harvesting tool, which will enable toexternal data catalogues into CKAN
 * Add view and edit harvesting sources - URL of catalogues, refresh, start harvesting jobs, etc
 * Harvest interface
 * At the moment you can harvest from other CKAN instances and other data catalogues
 * In future we want to do things like syncronise between CKAN/data catalogue instances, have a network of instances, and so on
 * Geospatial capabilities on CKAN: WMS preview and spatial query
 * Jonathan Gray
 * Publicdata.eu and LOD2
 * Faceted browsing, e.g.: http://publicdata.eu/?spatial=United+Kingdom&format=Excel
 * Communication, community building and demonstrators (e.g. http://energy.publicdata.eu/ee/ )
 * Survey for users of open data catalogues: http://survey.lod2.eu
 * John Erickson
 * Faceted browser over a catalogue of datasets (20s): http://logd.tw.rpi.edu
 * Just getting started
 * Keen to learn more about what people are doing and think about ways to integrate
 * Richard Cyganiak
 * DCAT vocabulary
 * http://www.w3.org/egov/wiki/Data_Catalog_Vocabulary/Vocabulary_Reference
 * Relevant paper http://blog.semantic-web.at/wp-content/uploads/2010/09/a37_cyganiak.pdf
 * Work has been done to make existing data catalogues DCAT compatible - e.g. Friedrich's work with LOD2
 * Work in progress draft specification - no final specification
 * Hope to be able to advance this work in W3C Governmental Data Working Group
 * Need a validator
 * Some variations, things are done differently, there are divergences - hence need for validator
 * Need to define an update mechanism
 * Need up to date list of DCAT producers and tools
 * Question of federation
 * Next steps: reconciliation across datasets, shared mappings, Linked Data hosting
 * Will Waites
 * Presentation of http://semantic.ckan.net/
 * Data storage in RDF (4store) using DCat
 * harvest CKAN/JSON and ISO19139 over CSW
 * harvest plumbing and simple dereferencing HTTP server written in Go
 * UI written in Javascript
 * SPARQL endpoing
 * All URIs of catalogue records dereferencable, content-negotiation to RDF or HTML (or graphviz!)
 * Special cases for CKAN - e.g. LOD cloud / voiD
 * Examples:
 * http://semantic.ckan.net/record/jrc-img2k_pr1_at5_multi <-- note not UUID argh
 * http://semantic.ckan.net/record/dcc6715c-bf94-4a89-bbf3-35933da795a5
 * Keep source data around (e.g. ckan:json "json-blob")
 * "cheats" to make clone of CKAN API
 * Software here: http://bitbucket.org/ww/gold
 * Some docs here: http://semantic.ckan.net/documentation/manual.html
 * Operational problems
 * people abusing fields -- see UUID argh above
 * unconfigured or semi-configured catalogues -- e.g. localhost:8080 in GetCapabilities
 * Christopher Gutteridge
 * University Data Catalogs
 * http://mind42.com/pub/mindmap?mid=605c3bad-3980-4d4b-9155-75b33af8860d
 * http://data-ac-uk.ecs.soton.ac.uk/
 * http://data.southampton.ac.uk/datasets.html
 * http://purl.org/openorg/ - namespace for open organisation relations not in existing schemas
 * Publish metadata in JSON that can be pulled into CKAN
 * Need to 'bribe' people by giving them useful tools to reward them for publishing their data in a nice way - e.g. http://opendatamap.ecs.soton.ac.uk/
 * Martin Alvarez
 * Open Data Catalogues in Spain
 * Maintain the faceted browser of OGD Initiatives http://datos.fundacionctic.org/sandbox/catalog/faceted/
 * Using dcat in projects
 * Trying to use CKAN in initiatives, but most of governments are "scared" of Python, they prefer to develop their own solutions using other technologies.
 * Cesare Gerbino / Saverino Reale
 * dati.piemonte.it: state of art and future trends
 * Next steps: integrate geoportal into main open data site using standards like CSW

Catalog Model
This isn't the model your data catalog must follow. But if you make a view of your data catalog using this model, it will make it easy for other catalogs to import your records, and therefore make your datasets more discoverable. This is intended as an interoperable core. Catalog publishers can include arbitrary additional data in addition to the concepts mentioned here. Catalog consumers will likely ignore such extra information.

We can revisit these proposals once we have experience from real implementations

Data catalogs:


 * A catalog contains dataset metadata
 * It may contain arbitrary collections of dataset metadata as first class objects
 * And it also contains dataset metadata as first class objects.
 * And it may contain licenses as first class objects

So far, interoperability is only defined on the dataset metadata level, not for collections and licenses, this will come later. The contents of a dataset is out of scope, we only care about dataset metadata.

Datasets:


 * Datasets may have a license.
 * Datasets may have distributions, all of which must represent all of the dataset data but in the case of services may contain additional data. We won't get too cross if there are minor variations at the moment.

Distributions:


 * Distributions can either be directly accessible or indirectly accessible.
 * They can be of type 'files', 'services', or 'other'.
 * If they are indirect the access URL tells a human how to access them.

Documentation:


 * Documentation regarding distributions may be linked to from the dataset.

dcat:permanentIdentifier
Add a new property on datasets specifically for an identifier that helps with deduplication when aggregating DCAT records from multiple catalogs:

> dcat:permanentIdentifier a rdf:Property; >   rdfs:subPropertyOf dc:identifier; >   rdfs:label "Permanent Identifier"; >   rdfs:comment "A permanent, universally unique identifier for the dcat:Dataset.".

This could be a UUID or an HTTP URI or whatever else, as long as it is unique. The ID must be kept intact even if the underlying dataset or any part of the metadata is changed. Aggregators must keep the identifier unmodified, even if they modify or enrich the dataset metadata.

This is distinct from dc:identifier because dc:identifier doesn't have to be globally unique; it's just an identifier with *some* scope.

(note: Atom's atom:id is interesting here -- must be an IRI, "permanent and universally unique", can be resolvable but Atom processors MUST NOT assume that it is: http://tools.ietf.org/html/rfc4287#section-4.2.6 )

There's consensus that this is a good idea.

Subtypes of dcat:Distribution
We want to have two dimensions:

1. direct access vs. indirect access 2. download vs. API vs. other

DCAT currently offers four options, implemented as subclasses of dcat:Distribution: dcat:Download, dcat:Service, dcat:Feed, and unspecified (plain dcat:Distribution). The effect of using these subclasses is that they clarify what dcat:accessURL refers to. For a dcat:Download, dcat:accessURL would be the direct download URL. For a dcat:Service, it would be an endpoint URL. For a generic dcat:Distribution, one would expect that it points to a web page that describes how to get the data (e.g., click this link here.)

dcat:Download maps to direct access download. dcat:Service maps to direct access API. dcat:Feed would probably count as direct access API as well.

DCAT doesn't allow distinction between download/API/other for indirect access. All of those would use plain dcat:Distribution.

Perhaps: have properties dcat:accessURL for machine access and foaf:page for human-readable access information. A problem with that is that some catalogs don't provide information about what's behind the link. The link could point directly to a CSV file, or to a page that contains download instructions for a CSV file.

Perhaps: accessURL as the general property, and directURL with the same value if and only if we know that it's a direct download/endpoint URL.

This requires further discussion. Perhaps work through some examples and how they'd be best expressed.

Still need to discuss collection and license at a later date possibly, themes and tags.

Proposed Harvest Mechanism
Catalog Harvest Endpoint is a URL. eg.


 * http://catalog.example.com/api/catalog
 * http://catalog.example.com/api/catalog.json
 * http://catalog.example.com/api/catalog.ttl

Must contain the description of the Data Catalog in that format and the capabilities of the endpoint. Catalog metadata, filtering capabilities if any, other format endpoints. It may return all matched results in this page, or some or none as it chooses. If some or none then it must include a URL to get the next set of results if there are any more. The subsequent results pages (if any) may have the catalog description again (but may not). Every endpoint should specify every endpoint of other formats including itself. Clients are assumed to obey the crawl-delay specified in the ROBOTS.txt file

It may accept additional parameters of callback=xxx and from=XXX where this is the ISO date format of YYYY-MM-DDTHH:MM:SSZ if so this MAY limit the results (from the start of this time). Other filters may be supported and the repository must return these although it may ignore them and return all items matching or not.

For each catalog record, the only import information to be returned is:


 * The GloballyUniqueID value
 * The URL to obtain the record in the same encoding format
 * An indication if the record has been removed from the catalogue (removal timestamp)
 * Catalog record last modifed date

A record may be any of a collection, a license or a basic dataset.

A consumer of this service can then use the "URL to obtain the record in the same encoding format" to do a get request to get the serialised DCAT for the combined "CatalogRecord and Dataset" metadata.

The broad goals of this are:
 * To make it possible for a small catalog publisher to just be able to create one static file for their catalog including all identifiers, and one static file per dataset metadata whilst at the same time to allow large catalog publishers to support paging. Some of the effort is pushed to the client here.
 * To allow web-based mashups as well as rich RDF exchange (the implication is that common JSON serialisations must be possible and that the RDF cannot get too complicated)

In either case we will document a recommended way of representing the underlying data in both turtle and JSON.

DataCatalogs.org
There are currently several lists of open data catalogues that are maintained by several different organisations. These include:


 * http://datos.fundacionctic.org/sandbox/catalog/faceted/
 * http://logd.tw.rpi.edu/demo/international_dataset_catalog_search
 * http://opengovernmentdata.org/catalogues
 * http://ows-search-engine.appspot.com/index?st=CSW
 * http://semantic.ckan.net/snorql/?query=SELECT+DISTINCT+%3Fsource+WHERE+{+_%3Arecord+dc%3Asource+%3Fsource+}+ORDER+BY+%3Fsource
 * http://rdf.opendatasearch.org/

Why does this need to change? Currently:


 * there is duplication of effort
 * people may not know where to submit new data catalogues to
 * not clear which is most up-to-date or comprehensive
 * users may be confused as to who to contact or which list to use
 * it may not be clear how to contribute

Hence we would like to propose datacatalogs.org, which:


 * would publish a single, shared, comprehensive, canonical list of data catalogues around the world
 * would be a joint project of key organisations and individuals interested in open data and data catalogues - including researchers, NGOs, developers, public bodies and others
 * would be an authoritative source for anyone interested in data catalogue federation, aggregation, interoperability, etc
 * would accept contributions from public, which would be curated by a group of key organisations and individuals
 * would be machine readable, openly licensed, and easily to duplicate and integrate with other services
 * would enable people to easily make sub-lists - e.g. official data catalogs, EU data catalogs, city data catalogs, etc.

Phase 1 would enable manual entries, moderated by a group of administrators. Phase 2 would be a fully distributed harvesting model.

Key features for phase 1:


 * Internationalisation of interface
 * Map showing locations of data catalogues (like CTIC)
 * Faceted browsing (like RPI)
 * Basic set of metadata - taking into account CTIC/RPI metadata models: http://bit.ly/ilCUfK

Phase 1
Action items to be completed by next meeting - in Berlin in August.


 * Chris: Create list of data.southampton.ac.uk IDs to make it fully CKAN compatible
 * LOD2: Pull Euskadi data into PublicData.eu
 * LOD2: Piedmont data into PublicData.eu
 * ?: Steal Borrow useful things from OAI ;-)
 * James: write to ckan-discuss list about data model agreement and harvesting proposal -> once formalized move to ckan wiki at wiki.ckan.net
 * Richard: DCAT blog post
 * James: Have datacatalogs.org up and running (requires flexible extras, new i18n to treat pacakges as catalogs)
 * James: Have CKAN implement the DCAT export/import proposed
 * James: Have CKAN support moderation (ie user contributions that can then be accepted)
 * James: Have auth support to allow complex permissions in the moderation process
 * Jonathan: organise date/location for next meeting
 * Lucy: Write a short page making case for 'why CKAN' and 'why Python'
 * James: Write a proposal for how to change the CKAN data model to better fit the data model agreed here as well as DGU use cases
 * DONE: Register datacatalogs.org
 * James/Jonathan: Set up new CKAN instance with 'package' renamed to 'catalog' and a custom form - with moderation enabled (by mid June).
 * Jonathan: Get people/organisations on board as (i) participating organisations (ii) curators
 * Jonathan: Make sure we have representative from lots of different official data catalogues
 * Jonathan: Get everyone to link to datacatalogs.org! :-)
 * Look at ROAR for inspiration re: interface and for ideas on what we may/may not want
 * Richard: come up with basic set up metadata based on John and Martin's findings - and chase others! :-)
 * John (Martin): create initial list to load into CKAN instance
 * Martin (Will): enable exploration of datacatalogs.org using
 * John: check with RPI team to looking into apply faceted browsing to CKAN hosted model
 * Richard: Explore informally by OKcon timeframes for considering submitting into the W3C standards process.
 * James/Jonathan: tell Friedrich what's going on!

Phase 2

 * A scraper for each data catalogue
 * Create a DCAT validator
 * Migrate everything to datacatalogs.org

Photos
Photographs of the event are available at:


 * http://www.flickr.com/photos/okfn/sets/72157626638023662/