Projects/Open Data Handbook

''We are producing a handbook that can be used by anyone but is especially designed for those seeking to open up data. It discusses the why, what and how of open data – why to go open, what open is, and the how to ‘open’ data. The handbook is online at http://opendatahandbook.org/.''

= Resources =


 * Official wiki (for development): Here
 * Version control for source:
 * Mailing list: http://lists.okfn.org/mailman/listinfo/open-data-handbook
 * Project coordinator: Laura Newman - laura [dot] newman [at] okfn [dot] org
 * Deprecated Original Google Docs version of the handbook

= Contributing =

The first step to contributing is becoming a part of the team and joining our mailing list.

On the content side, the team's authors write new content. Typically, this involves creating new sections of the handbook and emailing it to the team. This content is then edited by our editor, who merges the new content with the existing material. Once integrated, the new sections are then translated by translators.

Content is supplemented by support functions. Our designers help by creating diagrammes, page layouts and themes. Finally we need technical support from our build manager, who converts the source documentation, theme and translations into the wonderful handbook we see at opendatahandbook.org. All of this is overseen by a project coordinator.

Here is an overview of each of the roles:

Author
Responsibilities
 * to produce new content
 * to provide improvements to any of the current content

Typical activities
 * drafting content in an area of the author's expertise
 * creating a paragraph or two on a data format that is not currently covered
 * finding new sources for the glossary

Process
 * pick a topic area from the list below
 * notify the mailing list that you'll be doing this
 * Draft your text in whatever format is most comfortable for you.
 * Email your new material to the Project Coordinator (see above) or put on Google Docs and send a link to that page through to the list

Translator
We use Transifex to manage translations. The system allows you either to work offline and upload or translate in your browser. Every project within Transifex system collaborates with the others.

https://www.transifex.net/projects/p/opendatahandbook/resource/allpot/chart/image_png?.png

Responsibilities
 * to translate current content into local languages
 * to add local examples

Process:
 * Create an account at transifex.net (the Free account is fine).
 * Go to the Open Data Handbook project on Transifex.
 * Pick your language, and request to join the team. If your language is not listed, contact laura.newman[@]okfn.org
 * You will need to wait for your request to be accepted (contact Laura if this doesn't happen promptly).
 * Click on Resources > all.pot > your language.
 * In the pop-up box, click 'translate now' and get translating!

How to do references and terms:
 * It's not necessarily evident how to format multilingual references in Transifex. The following guide should help :)


 * The basic format is:
 * :role:`title `
 * So for glossary:
 * :term:`{my-translated-term} {original-reference}`
 * And for pages:
 * :doc:`{my-translated-page-name} {original-page-name}`

(NB: These (`) are **backticks**, not single quotation marks (')).

Editor
Responsibilities:
 * Become familiar with Sphinx, git, Github and reStructuredText
 * Review and proof submissions
 * Convert incoming text to reStructuredText
 * Merging designers' contributions into the text
 * Manage assignment of topics between authors
 * Facilitating mailing list discussions

Typical activities:
 * Become familiar with Sphinx, git, Github and reStructuredText
 * Communication with authors
 * Converting plain text sources to reStructuredText
 * Ensuring that any images and other creations from designers are integrated into the content

Designer
Typical activities:
 * Creating CSS-based themes for the HTML handbook
 * Creating images and visualisations for inclusion as part of the handbook
 * Conducting cross-browser testing of layouts
 * Developing typesetting guidelines for a print version of the handbook
 * Participating in mailing list discussions

Responsibilities:
 * Developing a new theme for the handbook
 * Maximising the handbook's readability
 * Ensuring cross-browser compatibility

Build Manager
Responsibilities:
 * gaining a good understanding of git, Github, reStructuredText and readthedocs.org
 * building the handbook from source

= Micro-tasks =

Only have half an hour to help? Here are some things that would be great if you could look into:

Code related

 * tiny geocoding guide, some ideas
 * how to get started with GeoNames
 * take a spreadsheet from an open data catalogue (thedatahub.org) and plot that data onto a map
 * strategies for bucketing time series data, e.g. an introduction into plotting and communicating many-dimensional data

Statistics related

 * a guide about types of graphs: pick one from your favourite (box & whisker, pie chart, ...)
 * write a few paragraphs on strengths and limitations of using means, standard deviation and other aggregate measures
 * terminology for humans: explaining what point estimates, confidence intervals, and other terms in statistics mean

Communications related
There is some useful semi-open (CC-BY-NC-SA) content at Wired's Howto wiki (http://howto.wired.com/wiki/Open_Up_Government_Data).

Visualisation related
It would be great if we could include more pictures in the handbook. If you know of any visualisations which are freely licenced, please send a note to the mailing list. We'll get them included in the handbook.

= Sections that need authors =

'''NB: This list is currently out of date (as of 6/3/12). We will be updating it shortly based on the feedback we have received from the v1.0 release'''

This is a list of ideas for sections in the handbook that does not currently exist. Please add your name claim the topic. Topics are presented in rough order of priority.

Generating information from data
Interpreting raw data, then generating visualisations and prose on the basis of it

The trick here is to avoid being too much like a statistics text book. There are plenty of those already. It would be best to lead people towards them.

Concepts to cover:


 * aggregate statistics
 * categorical
 * qualitative data
 * statistical tests??

Visualising data
Perhaps an overview. Again, don't want to be too much like a statistics textbook.

Gathering data
Techy: How to set an Arduino up as a temperature sensor and host a feed on pachube.

Standard: What makes up a good survey

Discipine-specific material
At the moment, the handbook is largely government. Can we branch the contents towards:

academia: how to store your data for publications and confirmation of results science: what to do when there is huge volumes of data commerce: what can businesses do to engage in the open data movement?

Linked Data
Concepts to explain: URI, Semantic Web, Giant Global Graph, OWL, SPARQL, ...

Some possible things to cover:


 * Reconciliation & disambiguation.
 * A guide to the community - how to get involved if you're not an academic
 * An overview of other resources, what they contain, what their audiences are
 * Crawling the graph.
 * (possibly) Introducing concepts related to predicate logic/artificial intelligence.

Note: Segeran's book on the Semantic Web is quite good and aimed at the right level.

Data transformation
A section convering how to merge datasets together.


 * Joining tables / relational theory
 * Named entity/place disambiguation
 * Converting rows into graphs
 * Aggregating data

Dealing with messy data

 * parsing rubbish HTML
 * parsing strings "23 July 2003"

Data recovery
Quite technical: may be useful for some use cases


 * How to recover data from optical media and old hard drives.
 * Decompiling old binary file formats.

Web scraping

 * Scaling. How to move to tens of thousands of requests.
 * Being nice to servers: respecting robots.txt, caching requests
 * Proxies

Cleaning data
Something like how to use Google Refine. It would be nice to be tool unspecific, though. A few paragraphs on when it might be useful on when to use a tool like Refine vs scripting vs a spreadsheet application.

Privacy
This could include a discussion about a number of strategies that people can take:


 * deleting unique identifiers
 * converting names/emails to md5 digests
 * dithering data points, e.g. adding random noise
 * aggregation

API guide
An overview of what a web API is, how to become familiar with API documentation. Might include what an HTTP is all about.

Buzzwords: SOAP, REST, JSON, XML, OAuth

Emerging trends: msgpack

API catalogue
Provide a paragraph or two about each open API that we can find.

Some things to provide for each API: thematic tags, description, likely users

Machine Learning/Natural Language Processing
A guide on how to get structured information from unstructured data sources.

OCR
How to set up an OCR pipeline. How to use all of the common free software tools. Possibly an evaluation of each of them.

Discussion of what a 'model' is in machine learning/OCR.

How to train new models for new languages, hand writing, weird paper.

Idea: (Fabric) script that downloads and installs OCR tools for whoever wants it

= Planning =


 * Create a 2 month plan for Open Data Manual -> http://opengovernmentdata.okfnpad.org/open-data-manual-2011-planning

= History =

The Open Data Handbook (originally known as the Open Data Manual) started as part of the Open Knowledge Foundation's EU Open Data working group. Its original work was created in a weekend sprint in late 2010. The initial documentation was crafted at okfnpad.org, which acts as a useful reference.

The project now lives as a self-contained project within the foundation. The community is mainly centred around the open-data-handbook mailing list. It is primarily supported by the open government data and the EU open data working groups.

Sprint Autumn 2010

 * Open Data Manual Google Doc
 * Folder
 * Tasks ticket (more useful links): http://knowledgeforge.net/okfn/tasks/ticket/329
 * Online: #okfn on irc.oftc.net (see http://okfn.org/contact/)
 * In person: Berlin, New Thinking Store, Tucholskystr.48, 10117 Berlin
 * Map

= Sub-pages =


 * Open_Data_Manual/Technology_Options lists the technological options for hosting and delivering the handbook.