Open Data Catalogues/1

= Open Government Data Catalogues Meeting =


 * What: Virtual workshop about open data catalogs and requirements
 * When: 11.2.2010 16-19 CET
 * Where: Skype + Etherpad + wiki

Participants

 * Antti Poikola (Finland), http://twitter.com/apoikola
 * Finnish open data advocate, guidebook, Apps for Democracy, Datacatalogs
 * Jonathan Gray http://twitter.com/jwyg
 * community coordinator: open knowledge foundation
 * ckan.net + data.gov.uk
 * jwyg on irc + jwygray on skype
 * founder: wheredoesmymoneygo.org
 * Luigi Montanez
 * LuigiMontanez on Twitter, Skype,
 * Software Developer, Sunlight Labs (US)
 * http://nationaldatacatalog.com (in beta)
 * Tom M. van Engers Skypename: LeibnizRules
 * Vassilios Skype name: vassilios.peristeras (Norway)
 * Resercher DERI, NUI Galway, eGov Cluster Leader
 * Fadi Maali Skype name: maali.fadi (Ireland)
 * (email: fadi.maali at deri.org)
 * Richard Cyganiak, Researcher at DERI, NUI Galway, Ireland
 * (skype: richard.cyganiak; twitter: @cygri; richard@cyganiak.de)
 * Interested in interoperability of data catalogues and exchange of statistical information
 * Interested in development of applications that can work across catalogs, e.g., for visualising statistical information
 * Daniel Dietrich (Germany)
 * Skype ddie22
 * Twitter & IRC: ddie
 * Chairman Opendata Network Germany http://opendata-network.org
 * German Chapter of the OKFN http://okfn.de
 * Starting German instance of CKAN http://de.ckan.net
 * Running Opendata Hackday and Apps4Democracy in Germany http://opendata.hackday.net http://apps4democracy.de
 * James Burke (Netherlands)
 * HackdeOverheid.nl, lifesized.net
 * Rufus Pollock - rufuspollock
 * Open Knowledge Foundation (co-founder)
 * ckan.net, wheredoesmymoneygo.org ...
 * David Eaves (Canada) - Status: on IRC, not on call
 * Advice the government
 * Open Data activist and adviser to the City of Vancouver (Canada)
 * http://eaves.ca email: david at eaves.ca
 * Jussi Arpalahti skype: jussi.arpalahti not on any calls, only skype chat and IRC
 * working on opengov.se codebase for opengov.fi as privately contracted
 * also works in Finland's central statistics office's dissemination unit
 * so interested both as a developer and a data distributor
 * Petri Kola - status: on IRC, skype-chat, not on call
 * skype: petrikola, irc: pe3
 * http://petrikola.org
 * Reasearcher from Aalto University, Helsinki, Finland
 * writing a strategic quide about opening up public data resources in Finland together with Antti Poikola and Kari Hintikka
 * involved in Apps4Democracy Finland and other open data related networks & projects
 * Martin Alvarez Skype user: martin.alvarez
 * Daniel Bennett (skype/twitter/etc citizencontact)
 * contact daniel at citizencontact dot com
 * CTO, eCitizen Foundation
 * W3C eGov: http://www.w3.org/2007/eGov/IG/wiki/ProjectData
 * Ed Summers (skype:inkdroid irc:edsu homepage:http://inkdroid.org)
 * (web) software developer at the Library of Congress
 * digital preservation: e.g. digital newspaper preservation http://chroniclingamerica.loc.gov
 * w3c egov working group member
 * Greg Lawrance, Spatial Metadata Services, GeoBC
 * BC Government Canada (geobc.gov.bc.ca)
 * David Read (irc: dread)
 * CKAN / data.gov.uk

Group 1: technical/interoperability
Who:
 * Rufus (HOST) rufuspollock
 * Petri (skype: petrikola)
 * Jussi (skype: jussi.arpalahti)
 * Richard (skype: richard.cyganiak)
 * Fadi (skype: maali.fadi)
 * Ed (skype: inkdroid)
 * Luigi (skype: LuigiMontanez)
 * [add your name here]

Questions:
 * can we define a metadata standard for exchanging catalog information? (Richard, Fadi, Ed, Jonathan)
 * what would the requirements for such a standard be?
 * what is there already in ckan?
 * how can we (do we need to) share updates for new datasets (Ed)
 * how can cross catalog queries work?
 * What features should we include for users -- what users do we anticipate having?
 * is there other oss data catalog projects with code like opengov.se and ckan.net and sunlight labs http://nationaldatacatalog.com/
 * should there be some naming conventions (in urls) - e.g. the word: "data"
 * is there some machine readable format for describing the quality of the data - e.g. method how the data has been collected
 * [please add your own]

What's happening?

 * Finland: Local catalogues and a national ones; there may be official and private-run ones
 * private, national-scope: http://opengov.fi/
 * national - temporarily sollution: http://suomi.fi/datakatalogi
 * UK situation: data.gov.uk but also a london data catalogue, and Irish data. (are these all on CKAN? are these interoperable pushing and pulling data from each other?)
 * Ireland: catalogs with different thematic scope: one for statistics, one for geographical data
 * Most important purpose of catalogs is discovery of data -- find the things that are scattered across government
 * Basic dublin core metadata
 * One goal: getting the catalog metadata into the search engines, make it findable
 * Ed: make sure that dataset pages are crawlable
 * Ed: above that: there are RDFa crawlers, Yahoo SearchMonkey
 * Rufus: in CKAN, we do redirects to an RDF store that sits beside the catalog
 * Rufus: http://semantic.ckan.net
 * Rufus: We put the metadata into the RDF store weekly
 * Rufus: That's great for developers. but RDFa might be better for search engines
 * Rufus: Initially we didn't want to build a catalog with CKAN, but the aim was to have a focal point for starting to find/clean up/organise data, a bit like Linux package managing -- make those datasets usable by tools
 * Ed: Great goal to have, seen from a CPAN background. Can someone do "ckan install package123"?
 * Rufus: datapkg: http://knowledgeforge.net/ckan/doc/datapkg
 * Rufus: problem in the catalogs: it's really random what you point to; might be a CSV, might be a SPARQL endpoint, or any number of other format
 * Jussi: http://opengov.se http://test.opengov.fi
 * Petri: We want that ppl put requests for data into opengov.fi, so it's not just about listing what's open but also for making requests
 * Rufus: http://isitopendata.org/ - inspired by whatdotheyknow.org
 * Rufus: In the UK you can request data to be openend (freedom of information requests) - data request is different
 * Rufus: Idea with IsItOpen was to do similar for data
 * Rufus: You could run this as an application seperately from CKAN
 * Petri: This is about EU PSI directive; in UK you have an organisation that you can ask to open up data; we don't have this in Finland
 * Richard: not in Ireland either
 * In Finland and in Ireland FOI is not nearly as strong as in UK or USA
 * Rufus: Starting point for us was about science, getting the data out of journal articles etc
 * Richard/Fadi: metadata list and fields on data.gov directly as a package (which is good)
 * ckan.net had ckan as a package from the start ;) http://www.ckan.net/package/ckan
 * and on data.gov.uk via rdfa or via json/csv dump: http://www.ckan.net/dump/
 * drafting an ontology for catalogues -- richard: you must speak wwaites about http://semantic.ckan.net/ (he's been using scovo and thinking about void)
 * http://knowledgeforge.net/ckan/semantic/ is the mercurial repo
 * Ability to get the dataset: download url or ...
 * http://www.ckan.net/package/ukgov-finance-cra - has an xls page. equivalent description on data.gov.uk is http://data.gov.uk/dataset/public_expenditure_statistical_analyses
 * http://www.ckan.net/package/tiger-geodata - just a landing page
 * http://knowledgeforge.net/ckan/trac/wiki/PackageResources
 * Rufus: 40% of CKAN packages don't have a download URL,
 * Rufus: or often it just goes to some index page of downloadable file
 * Rufus: In CKAN you can say in the search wether you want "downloadable" only
 * Ed: A vocabulary should distinguish between the kind of download/access link (go here for the CSV, here for the XLS, here for the SPARQL endpoint)
 * Rufus: Ppl want to associate different URLs with one package (e.g., (a) download comes in different parts; (b) download of same data in differnent formats (RDF vs JSON or TAR vs ZIP etc)
 * Which are the relevant formats? CSV? RDF? Are any domain specific XML-formats interesting?
 * apis: these can also count as download urls
 * Rufus: No one puts their package descriptions in a coherent way; we hope that these catalogs make this better
 * Rufus: Someone could make a package format for data (like Debian .deb files)
 * Ed: wondering if `datapkg install http://data.gov.uk/dataset/public_expenditure_statistical_analyses` is a potential use case. treating the dataset web page as the metadata description document (is that crazy?)
 * datapkg install ckan://ukgov-finances-pesa mydirectory
 * is this already working somehow?
 * only difference there is the ckan vs http URI? *how* the metadata description is retrieved
 * Rufus: An API should not really count as open access to data
 * Richard: Yes, API (e.g. SPARQL endpoint) is great for poking around or simple apps, but for serious use you want your local copy in your local store, so you need a download URL
 * Petri: How does CKAN do updates? Update feed?
 * Rufus: http://www.ckan.net/revision/ that's human-readable + Atom feed (http://www.ckan.net/revision/?format=atom)
 * Petri: Statistics people know a long time in advance when they will have updates to their statistics available. Can you have RSS feeds of upcoming updates? Would be interesting to know about the schedule
 * CKAN: feed just shows changes to metadata
 * Rufus: CKAN has version field on each package and you could use that to indicate changes to underlying dataset
 * Ed: with the command line installer you'd want to periodically update the packages
 * Ed: what about datapkg install http://.... -- use rdfa in the page to get metadata and commence installation
 * datapkg http://data.gov.uk/dataset/public_expenditure_statistical_analyses or datapkg http://data.australia.gov.au/437 or datapkg http://opengov.fi/data/13/
 * http://knowledgeforge.net/ckan/doc/datapkg
 * Richard: http://semanticweb.org/wiki/VoiD could be used in descriptions to describe pure-rdf datasets.
 * We had already been looking heavily at VoID for use on semantic.ckan.net :)
 * Rufus: one of the things we've though about is that datapkg could support plugins. For example where to get a data dump from, or a sparql endpoint from based on a VoiD description.
 * datapkg spec://{some-id}
 * 'spec' a parameter like 'ckan'
 * datapkg void://{id} -> load void plugin -> read void metadata -> locate sparql endpoint or data dump
 * datapkg http://....
 * Richard asks: what metadata do i need
 * Rufus: Download url
 * Richard: all very well but need more specification of the actual nature of the data
 * To have really automated use need *Knowledge APIs* (Automated install is the easy part)
 * Ed: Let's say a data package installs some SQL. here's some science fiction: The data package has a dependency on a database, hence it will install MySQL
 * Or if it's a SPARQL endpoint, it might install a client library
 * "Dependencies" are next on our list of features for ckan ...
 * Rufus: Would be great to have a group/community of people where we put other people's data into a standard format and package it up in a std way
 * Ed: very similar mission to Linking Open Data (good thing)
 * Rufus: we're doing quite a bit of linked data (both ckan and with other projects) (is that rdf ckan dump available at a url? :-)
 * Richard: what's the use of all this data? (and particularly: what's the use of packaging it up cleanly?)
 * Rufus: we have a frontend project such as http://www.wheredoesmymoneygo.org
 * Ed: Richard's question is extremely important, to not just look at the technical side, but the *why*
 * Ed: my personal perspective (from loc.gov) is "lots of copies keeps stuff safe" - preservation of data http://en.wikipedia.org/wiki/LOCKSS and getting people to install (copy) datasets helps to keep it around :-)
 * Petri: can ckan help in having an efficient diff for datasets?
 * Rufus: A catalog might make it easier to fork data, improve it, and roll it back in that would be an incentive for the data owner to publish the data on the catalog
 * Sidebar: CKAN's domain model:
 * Key object is: Package - has arbitrary metadata - key/value
 * Associated to it:
 * tags
 * groups
 * Package resources
 * CKAN's domain model: http://lists.okfn.org/pipermail/okfn-help/attachments/20100115/8ba30e81/attachment.py
 * Why?
 * Getting overall vision of what is out there
 * Big datasets are often relatively easy to find but small dataset
 * Good markup (e.g. rdfa) into pages so they get found by search engines
 * Catalogue as a focal point for requests
 * Catalogue as enabling reuse
 * Catalogue assists preservation
 * Catalogue helps on distributed data development

Group 2: user interface/experience + community
Who:
 * Jogi (skype:antti.poikola)
 * Jonathan: (skype: jwygray / irc: jwyg) (HOST?)
 * Daniel Dietrich (skype: ddie22 irc: ddie)
 * James (skype:jimmzy irc: james_burke)
 * David Eaves (skype: david_a_eaves / irc: deaves)
 * Philip Ashlock (skype: philipashlock / irc: philipashlock / twitter: philipashlock)

Questions

 * What kind of features should a catalog site have in order to help in building a community? Not just technical superiority, but nice UI, social media, discussions?
 * See James' contribution at the end of this pad
 * how to start looking at all this questions? technically vs. result orientated? (ddie)
 * what expectation to userinterface and functionality do users / editors have
 * where/how should governmental organizations publish their data in a county with no data catalog
 * is there other oss data catalog apps like opengov.se and ckan (how about data.gov.uk and data.australia.gov.au>)
 * [please add your own]

Notes for guide to data catalogues
1. What is a data catalogue for?


 * supporting data reuse
 * discovery
 * data geeks understand, but how to make it more popular?
 * plugin for search engines
 * place to find the small things (national statistics can be found anyways)

2. Benchmark of existing catalogs
 * Who is behind the catalg (official, community driven, city, national)
 * What are the features frontend?
 * What are the features backend?
 * What technology is used?
 * Is the catalogue organizing "only" metadata or also storing data?
 * What data is stored in the catalogue (only open, also closed datasets, only official or also twitter API)
 * Is there good documentation & technical support?

3. Platform features
 * data versioning
 * interoperability
 * hosting? (storage vs registry)

4. What features are desirable?
 * code examples to use the API
 * possibility to have a conversation around the data sets

Also:


 * Tutorial for different systems?
 * Questions

Action items

 * Jonathan: look into drupal code from data.gov.uk
 * Jonathan: to look into possibility of 'approval queue' for admins?
 * Jonathan: more info about CKAN + what its used for + sustainability + who is behind it
 * Think about CKAN as middleware: making it *easy* to retheme etc.
 * Show number of downloads of data + more stats
 * Document for people interested in building a data catalogue - towards a common infrastructure for data catalogues - introduction to data catalogues
 * feature ideas to ckan-discuss + adding things to trac
 * regular meetings on ckan

Minutes from IRC
(starting only halfway into the call, sorry)

main topics for today: (i) interop between catalogs (ii) who is interested in starting new catalogs (iii) expectations/requirements for catalogs

We are adding issues, questions and potential discussion topics on PiratePad

Notes from James Burke on CKAN user experience

 * User experience ideas/comments, please parse this into main doc. (james burke)
 * Possible users
 * for developers outside government
 * for the IT guy inside governmen
 * for the government worker
 * for journalist non-technical person
 * These are the main actors for a data catalogue system.
 * Phases of use:
 * Awareness > Trial/1st time usage > Common Use > Abandon Use
 * User experience
 * Logging In/Sign up:
 * 1) Not directly clear how to use your other IDs how to login. Would be easier to have a log-in that offers more accounts on first page
 * Adding Data:
 * 2) Word 'package' is a little unclear. Can't you just have a button "Add data".
 * 2a) There is no general help on http://www.ckan.net/package/new on what to fill in on this page, and there is an awful lot to fill in. perhaps opt for something simpler on the front page like delicious.com, which let's you get started even if you don't have an account.
 * 2b) There is no clear purpose places anywhere visible for contributors of data to CKAN
 * Search/Finding data
 * Pretty well handled for beginning, but will need work as content blossoms.
 * User pages
 * Shows commits to packages. Very geeky, not so human/social?
 * Data quality/ Ease of reuse
 * There is no clear rating on how open or closed, or fast you can gain access to the data in a reusable format. For instance: http://www.ckan.net/package/ukgov-finance-main-supply-estimates You get a link, inside which you are presented with website full of further links. Data could be in .pdf or closed format. (how about adding a game logic to encourage useful entries)
 * Navigation
 * Download button has a styling error in Chrome, green arrow is appearing on line above text so no clear connection between button and function.
 * Easy to download and install?
 * Can't find where the download is. Maybe it's deliberately doing this. It's just not going to get a lot of development if source code so hard to find....
 * More visually easy way to see data sets and data behind them (see http://data.vancouver.ca/datacatalogue/index.htm )
 * Visual Design
 * Lacks any kind of design touch