Workshop Open Science and Social Science OKCon 2011

Notes and follow up from workshop at OKCon 2011.

For background read this post introducing the workshop.

= Pads =


 * http://etherpad.mozilla.org:9000/openscienceberlin
 * http://okfnpad.org/data-digitizer

= Photos =


 * http://www.flickr.com/photos/teleyinex/sets/72157626987703445/

= Dump from Main Pad =

Welcome to the Etherpad for OKCon 2011 Berlin hackfest on open science and social science!

Wiki page (where we'll migrate notes at end of workshop for permanent home!): http://wiki.okfn.org/Workshop_Open_Science_and_Social_Science_OKCon_2011

Panel:


 * Francois Grey from Citizen Cyberscience Centre, CERN, and the Shuttleworth Foundation
 * Nicolas Maire, Swiss Tropical and Public Health Institute, University of Basel, http://malariacontrol.net
 * Carl Christensen, Stanford University, Quake Catcher Network http://qcn.stanford.edu, Oxford University, http://climateprediction.net
 * Daniel Lombraña González, Citizen Cyberscience Centre, CERN. LHC++@HOME: Test4Theory (http://boinc01.cern.ch).
 * Lucas Mation - Brazil (Institute of Applied Economic Research, think tank advising government for planning). 1999 started publishing data from the models. IPEA data, http://ipeadata.gov.br/. .Large amounts of scanned material.
 * David Aanansen. EpiCollect. Mobile collection of data.
 * Rufus Pollock, Open Knowledge Foundation. http://ckan.net/
 * Jenny Molloy, Department of Zoology, University of Oxford and Open Knowledge Foundation (Coordinator of Open Data in Science Working Group http://science.okfn.org)
 * Einstein @ Home.
 * Caspar Addyman. Birkbeck, University of London, Using smartphones to collect data on alcohol and drug use. http://YourBrainonDrugs.net
 * Mirjam. New media arts curator working with environmental data mapping projects.
 * Axel, Daniel. Programmers at Stockholm University, Sweden. Open tools for social sciences. Large datasets as charts online.
 * Nannette Schliebner. Humboldt University. Economics. More transparent.
 * Richard Littauer, Linguist from Edinburgh. Workflows in science. http://notebooks.dataone.org/workflows. Also, open Linguistics/language databases (social science databases.)
 * Guo Xu. Research associate at German Institute for Economic Research (DIW Berlin), PhD Econ at LSE, Coordinator for OKF Open Economics working group. http://openeconomics.net/
 * Raine Rimpilä from Floapps Finland. Programmer working on CKAN - http://ckan.org/
 * Axel Kittenberger, University of Vienna. Social Sciences, Science and Technology Studies, Coding, Visions and Future Construction of Open Gov Data
 * Miguel Said Vieira, PhD student at University of São Paulo, member of Scientiae Studia (academic society focusing philosophy, sociology and history of S&T)
 * Tim Hubbard. Human Genome Project. Where Does My Money Go.
 * Peter Murray-Rust. University of Cambridge. Panton Principles, Open Science, Open Bibliography

The hackfest is from 10am-6pm on Wednesday 29 June at the Kalkscheune in central Berlin http://www.kalkscheune.de/en/ Registration for the hackfest is free of charge. For the full OKCon programme (30 June - 1 July) and how to register see http://okcon.org/2011/programme

The hackfest is open to anyone who cares about open science, and will focus in particular on the use of volunteer computing and volunteer thinking in economics and social sciences.

You may be a professional scientist, a citizen scientist, a programmer, a web designer, or just plain curious. Whatever skills you've got, they're bound to be useful, provided you come with an open mind and a lot of enthusiasm to get something done.

This etherpad is where YOU put suggestions for what you would like to think about, talk about and hack about during the workshop.

So just to get the ball rolling, here are some of the topics I know our invited experts are interested in. Add more, and we'll sort them and assign teams to the most popular and promising ones on the day:

IDEA 1 (Economic) Data Digitising
DEDICATED ETHERPAD: http://okfnpad.org/data-digitizer

Science Lead

Mission:

Digitizing archival government data about economics that is in tabular format. OCR can't do this alone, because of the spatial context. We need crowdsourcing. But how to combine the two?

Opportunities: Creation of a web application that will allow volunteers to mark which areas of the scanned pages have tabular data.

Areas of collaboration: interface design, development, documentation and translation.

Skills required: programming (Javascript, PHP, BOSSA, Ruby, RoR, Sinatra, etc.), documentation, web design, testing, bug triage, translations.

Scanned material at: Memoria Estatística do Brasil project, http://memoria.nemesis.org.br/

Example test case: The user will open a web page, and there he will see an image of the scanned documents, and he will be requested to answer several questions:

1.- Do you see a table in this image? Possible answers: Yes or No 2.- If answer == Yes: 2.1.- Please, draw rectangles on top of the image, marking the tabular data (http://dl.dropbox.com/u/27667029/ocr-mockup.zip) 3.- Else: 3.1.- Load next image.

These simple steps will filter which scanned pages have tables. Additionally, the tables will be divided in cells, allowing them to analyzed or transcribe them lately. This information could be used for automatically cutting the tables into pieces, and later send them to users for transcription or to an OCR pipeline or software. Then, the information should be joined together at a final step.

To sum up, the application should be a Javascript application that allows to draw rectangles over an image, saving the coordinates of them in a database through a RESTful implementation. The data at this stage can be stored in a SQLite database.

There is a mockup for the project, please modify it, add suggestions, etc.

Mirror 1 http://dl.dropbox.com/u/27667029/ocr-mockup.zip Mirror 2 http://ubuntuone.com/p/yuU/

Long term ideas: Discuss about using Twitter OAuth implementation http://dev.twitter.com/anywhere for managing users or use a complete solution like BOSSA (http://boinc.berkeley.edu/trac/wiki/BossaOverview), which has already a user management infrastructure (training module BOLT (http://boinc.berkeley.edu/trac/wiki/BoltIntro), forums, etc.).

IDEA 2 BOINC on Phones (Android)
Running agent-based simulations of socio-economic impact of, say, diseases like malaria on smart phones and pads. Yes, the computing power of the future is in our hands - so can we help the volunteer computing transition from the PC and laptop to hand-held devices?

Science Lead: Nicolas Maire, malariacontrol.net

Port malaria code to Google Android, probably via Android NDK to compile native C/C++ code and then write an Android Java "stub" to do the computation on Android devices (i.e. mobile phones, tablets, etc). On a grander scale we will also need BOINC to run on Android to launch the Android malariacontrol.net

OpenMalaria code is at http://code.google.com/p/openmalaria/ Android NDK: http://developer.android.com/sdk/ndk/index.html Android-CMake: http://code.google.com/p/android-cmake/

PMR + Jenny Molloy are also doing this - Jenny has a hackfest in London on ? Dec 18th (GULP - just seen Jenny is here already :-) I think this is quite different, we weren't planning to run simulations were we? (maybe it will crop up though!) However, if anyone is interested we are looking at generating open malaria research reports in conjunction with the SWAT4LS conference in London on Dec 6th. Scratch whatever I said... I came in late.

IDEA 3 Geotagging Data
Mapping images of humanitarian crises published on the web using contextual information (detective work) and pattern recognition from satellite images, to do damage assessment and help plan investment for post conflict reconstruction.

Version of this for smoking (tobacco free initiative)

Suggestion: discuss privacy considerations for both cases.

IDEA 3.5
OpenSpending: interesting / wrong or not.

1.9m entries: http://openspending.org/dataset/ukdepartments http://openspending.org/dataset/ukdepartments?q=robot

Distribute out items and ask people to:

a) rate importance / correctness b) To locate further information (such as who is company)

IDEA 5 Meta-metrik - Kill or Cure
Create a simple "proof-of-concept" format for saving regression results and a database to store the results. Incorporate this as a "game" to encourage people converting the tables into the open standard (similar to the Lucas' idea of "outsourcing" data inputation to the greater public as part of participatory "open science").

(See http://openeconomics.net/open-econometrics/ for idea)

IDEA 6
Not quite an idea ...

There is all this resulting data from these crowd-sourcing projects. Can we:

a) Get info on whether the resulting datasets are open b) Register this info into http://ckan.net/

IDEA 7 (merged into discussion)
Cost of citizen cyberscience vs cost of data centres and supercomputers

Idea 8 Data Processing and Quality Assurance
Integrating data cleaning and general QA/QC steps with open databases and volunteer networks. (c.f. FilteredPush Network (in Kepler))

Initial thoughts: (Discussion) * What sort of computational infrastructure do we need to streamline volunteer responses to cleaning databases (in any science?) * What sort of way do we have to advertise this network? * What sort of data is easily accessible and changeable by the public? How would we enable this? * What sort of databases should be targeted first for human curation?

The results of our discussion are in here http://etherpad.mozilla.org:9000/openscienceqaqc

Idea 9
Work out how to make something compelling.

GEOTAGGING TEAM

Issues: No risk of gaming/misuse of data (no faces in the pictures, no chance that one restaurant will use it to fine another restaurant) Beautiful data/making it compelling Find the people who will want to do this Use existing reviews to mine data - use hotel data too Will this help the restauarants enforce the law or not? Smokers use the information to go to restaurants that allow smoking. Only positive or negative databases? Focussing on restaurants that are pro non-smoking! Getting data about how they kept and increased the number of customers. Health Ministry to get data from people. Connect to tourist guide or real award (that enforces) Spectacle computing Use open knowledge

Very difficult to extract data about smoking from reviews (has to be done be hand).

(Somewhat old) list of "smoke-free" restaurants in Berlin: http://www.gablinger.com/nonsmoker.html

List of smoker restaurants and bars in Berlin http://www.raucher-lokale-berlin.de/

Searchable database of smoke-free restaurants in Japan (found it through Google): http://www.kinen-style.com/

Non-smoking restaurants in Beijing http://www.meiguoxing.com/Guides/Non-Smoking_Restaurants_in_Beijing.html (made for Olympics, not upated since March 2, 2010 5:30:33 AM

Suggestions: Two websites, just for data collection. One for smokers, one for non-smokers. Integrate data, and present in different ways to different groups. Use data from catchy website to map an overall picture.

LIBYA CRISIS Questions: 1) Are we inadvertantly creating data that could be used against people (by knowing when and where rebels were at a certain time)? 2) Is the data from the damage assessement available openly? What about the money allocated for repairs?

DIGITIZATION TEAM

= Dump from Data Digitizer Pad =

IDEA 1 (Economic) Data Digitising
Source Code: https://github.com/okfn/datadigitizer

Workflow:

1) Image Preparation (DONE) 1.1) Get PDF of book (DONE) 1.2) Split PDF of book into invidual page images (pbm format) (DONE) 1.2.1) Split double page images into two pages. 1.3) convert pbm to pmg (DONE)

2) Job Management 2.1) Integrate BOSSA acount manager. (Credit can be atributed because each user is only sent to a page at the time) 2.2) Book Metadata Spreadsheet in googleDocs (DONE) 2.3) Table Metadata Spreadsheet in googleDocs (DONE) 2.4) API automatically adding lines to the Table Metadata Spreadsheet with id_code, and images associated. 2.5) API automatically creating a Spreadsheet associated with that table.

3) User interface 3.1) Is there a table in this page? 3.2) Vizualize the page image on the lelft of the page (DONE)   3.2.1) Include zooming and tools. (https://github.com/okfn/datadigitizer/issues/3) 3.3) Vizualize google spreadsheet on the right of the page (DONE)   3.3.1) take the headers off as much as possible (preferably only table) 3.4: Visulalize the form extracting 3.5) Create HTML form for metadata and embed in page

1. List of metadata and book PDFs -- google doc [JM + ...]:

https://spreadsheets.google.com/spreadsheet/pub?key=0AneuslFwgzLMdGZOWEFhT0NfS1pNSW0xM3h0bWI5YXc&output=html

Annuario estatistico de São Paulo (Brasil) 1904. São Paulo: Typographia do Diario Official, v. 2, t. code (colum B of the book metadata spreadsheet): 305 http://memoria.nemesis.org.br/trf_arq.php?a=00010005

For metametrik sample
Metadata sample https://spreadsheets.google.com/spreadsheet/ccc?key=0ArxHUTuXlPT2dERMeVZLQi02YnJoN25IY2x4Yk9tdGc&hl=en_US

Example for final format: http://www.guoxu.org/okfn/mankiwromerweil.xml for http://www.guoxu.org/metametrik/MRW1992_2.png

2. Split the PDFs: Use PDFimages tool (http://linuxcommand.org/man_pages/pdfimages1.html)

Sample images: http://dl.dropbox.com/u/27667029/00010005.tar.bz2 (23 MB(

3. Show a task 4. Handle tasks

Participants in Brasil
Nigini Abilio Oliveira -nigini@gmail.com, PHD-candidate at UFCG

Plan for today
Discussion of workflow: Input = scans (DONE) and small amount of metadata Output = machine readable

Problem: OCR is not working very well due to lots of tabulated content in documents. Additional problem that documents are not easily machine readable due to old typeface and image files Options to overcome problem: 1) Volunteers transcribe whole table and input all data into spreadsheet. 2) Volunteers mark up format of table and notes inc. presence, columns

Consensus is that this task would be impossible using OCR, therefore work on option 1!

The Plan

Build a web app displaying a Google spreadsheet next to the image containing data to be transcribed.

Using BOSSA/BOLT? http://boinc.berkeley.edu/trac/wiki/BossaIntro

Discussion: PMR: Are we aiming to develop a technology today, which has the Brazilian data stes as a possible use case, or are developing this as a specific solution to the problem presented by that dataset? Not entirely decided, would be good to make it as general as possible.

Overview
Mission: Digitizing archival government data about economics that is in tabular format. OCR can't do this alone, because of the spatial context. We need crowdsourcing. But how to combine the two?

Opportunities: Creation of a web application that will allow volunteers to mark which areas of the scanned pages have tabular data.

Areas of collaboration: interface design, development, documentation and translation.

Skills required: programming (Javascript, PHP, BOSSA, Ruby, RoR, Sinatra, etc.), documentation, web design, testing, bug triage, translations.

Scanned material at: Memoria Estatística do Brasil project, http://memoria.nemesis.org.br/

Example test case: The user will open a web page, and there he will see an image of the scanned documents, and he will be requested to answer several questions:

1.- Do you see a table in this image? Possible answers: Yes or No 2.- If answer == Yes: 2.1.- Please, draw rectangles on top of the image, marking the tabular data (http://dl.dropbox.com/u/27667029/ocr-mockup.zip) 3.- Else: 3.1.- Load next image.

These simple steps will filter which scanned pages have tables. Additionally, the tables will be divided in cells, allowing them to analyzed or transcribe them lately. This information could be used for automatically cutting the tables into pieces, and later send them to users for transcription or to an OCR pipeline or software. Then, the information should be joined together at a final step.

To sum up, the application should be a Javascript application that allows to draw rectangles over an image, saving the coordinates of them in a database through a RESTful implementation. The data at this stage can be stored in a SQLite database.

There is a mockup for the project, please modify it, add suggestions, etc.

Mirror 1 http://dl.dropbox.com/u/27667029/ocr-mockup.zip Mirror 2 http://ubuntuone.com/p/yuU/


 * Sample images: http://dl.dropbox.com/u/27667029/sample-images.zip
 * BOSSA Mindmap (online - imported from .mm): http://www.mindmeister.com/maps/show/103622190
 * BOSSA Mindmap: http://dl.dropbox.com/u/27667029/bossa-mindmap.mm (to open it you need FreeMind http://freemind.sourceforge.net/wiki/index.php/Main_Page)
 * BOSSA Mindmap Flash version: http://dl.dropbox.com/u/27667029/bossa-mindmap-flash-version.zip

Long term ideas:


 * Discuss about using Twitter OAuth implementation http://dev.twitter.com/anywhere for managing users or use a complete solution like BOSSA (http://boinc.berkeley.edu/trac/wiki/BossaOverview), which has already a user management infrastructure (training module BOLT (http://boinc.berkeley.edu/trac/wiki/BoltIntro), forums, etc.).

Technologia de OCR:
 * Abbyy fine reader (aqui). Segundo especialistas, é a melhor solução para OCR. Precisamos verificar também se tem soluções para pré e pós processamento.

Softwares testados durante o Hackfest:


 * Tesseract, foi o programa que funcionou melhor. Não funciona para as tabelas. O texto funciona razoávelmente bem, mas os títulos ficaram horríveis.
 * http://googlecode.blogspot.com/2006/08/announcing-tesseract-ocr.html from Google and Open Source (runs on Windows and GNU/Linux). (2006)
 * Médio: Ocropus need to be tested with more detail, as it is using tesseract version 2.
 * Outros programas testados que nao funcionaram: gocr, ocrad,

Experts no Brasil:

LIBER, Laboratório de Tecnologia da Informação. (exemplo aqui) http://ocropodium.cerch.kcl.ac.uk
 * Rafael D. Lins, UFPE (página pessoal) (paper comparando opções de OCR), rdl.ufpe@gmail.com
 * Joaquim Nabuco Project (paper contando história aqui)
 * Referências sobre OCR em forums na Internet
 * http://www.linux.com/archive/feature/138511 analysis on several tools. (2008)
 * http://www.claraocr.org/
 * http://www.claraocr.org/en/ocr/ocr-software/open-source-ocr.html more open-source alternatives. OCRtopus, usado pelo googlebooks?
 * http://jocr.sourceforge.net/
 * Ocropodium, running at KCL in the UK to test and extend Ocropus - might be a useful collaborator
 * http://code.google.com/p/ocropodium/
 * Videos sobre digitalização + OCR (congresso, França 2010)

Usando Voluntários:


 * Demo do de plataforma para marcar tabelas em imagens (clique aqui > crie uma conta > clique em "Get job">"Brazil OCR".
 * Vesão preliminar, desenvolvida no Midware “Bossa”, por David Anderson durante o Hackfest. “We developed a crude prototype of a volunteer thinking app where people identify the tables in document images:
 * Ainda precisa ser acertada pelo David
 * Precisa programar em PHP a interface que o voluntário vai usar.
 * Bossa : middleware para projetos usando voluntários na internet
 * Documentação aqui
 * Apresentação aqui
 * Outras iniciativas:
 * Distributed Proof Readers (aqui) (paper sobre o projeto aqui)
 * Transcribe Bentham (aqui) (video tutorial aqui)
 * Manuscript Transcription by Crowdsourcing: Transcribe Bentham

Neste post vários outros projetos de transcrição de manuscritos são listados

Historical Documents in a Digital Library: OCR, Metadata, and Crowdsourcing « Lemonade & Information
 * Post detalhado sobre vários projetos de Crowsourcing para transcrição de documentos: 2010: The Year of Crowdsourcing Transcription (outros posts interessantes no blogo)
 * FromThePage software para transcrição de páginas.
 * Crowdsourcing: How and Why Should Libraries Do It? http://www.nla.gov.au/ndp/project_details/
 * Crowdsourcing in the Cultural Heritage Domain: Opportunities and Challenges
 * An Assessment of Intrinsic and Extrinsic Motivation on Task Performance in Crowdsourcing Markets
 * A Game-based Approach to Transcribing Images of Text Khalil Dahab, Anja Belz
 * Transcription in a Digital World, ActiveHistory.ca
 * http://scripto.org/, programa open source transcrição de páginas na internet
 * Libraries and crowdsourcing - 6 examples
 * Crowdsourcing OCR correction through game playing
 * OCR Text Correction is a Good Project for Crowdsourcing

Internet Archive:

Detalhes sobre processo de escaneamento (aqui):

Repositórios digitais:

The Dataverse Network http://www.iq.harvard.edu/projects/dataverse_network
 * Dspace: software livre para organizar repositórios digitais criado pelo MIT (veja vídeo aqui)

Future Stuff: - find out the logic of the names of the PDF in the Memória website.