Wg/linguistics/workflows

=General Overview=

There is currently no central repository for workflows, pipelines, or shell scripts in Linguistics and other, related social sciences - namely psychology, anthropology, history, classics. This may be do mainly to the fact that Linguistics, especially theoretical Linguistics, is not computationally heavy, and as there are no programs (not including NLTK and other computational linguistics add-ons, or others that I may not know of) whose sole purpose is to help with computational linguistic workflows (such as Kepler, Taverna, Vixtrails, and others work for the other sciences). Many published papers today include appendixes with the data related, but they do not include the code used to parse through that data. In the interest of reproducible and open science, we are currently trying to see what interest there is in setting up a public repository of linguistics scripts/workflows. After this work, another option is to simultaneously start an open access journal to publish papers, data, and scripts together in a single place, so that the inquisitive scientist has all of the tools needed at his disposal to reproduce a paper's result. The aim is to have the server be powerful enough to run such code there, or in the cloud, so that installation of the needed software won't be necessary for each downloader.

Central Questions
Hopefully, it will have DOIs, be a bit of a social network like myExperiment, and most importantly will not be a large and unsuitable site for small codes from the social sciences, but rather a place where tags and interconnected users can see each others code. It's less about being just a repository and more about being a registry of code.
 * What will this offer beyond GitHub?

It is not based off of dynamic workflow systems, but off of pre-existing code snippets. Hopefully, workflows will eventually evolve (and hopefully, due to the site.) Until then, this is a repo for linguists, not a bioinformatics workflow repository.
 * What will this offer beyond myExperiment?

Specificity of fields - this mostly has to do with linguistics.
 * What will this offer beyond the Open Research Computation journal?

More
In Linguistics (and similar social sciences), there are no standard 'workflow workbenches' that can be used for non-programmers to develop, use, and share their workflows. However, as an increasingly data-intensive science, computational linguists are using computational pipelines in their research, in order to facilitate their main work. In some occasions, this code can be uploaded as a supplement - the Journal of Experimental Linguistics is a good example of a journal that strives towards providing extra supplementary material needed for reuse and reproducibility.

However, the use of supplementary code only applies for single journal articles, and while open access and open source projects are common (to an extent) in Linguistics and the Social Sciences, there is not as yet a single repository for code, of any sort; either workflows or pipelines, or codes that are project based, used in a publication, or useful in non-publishable or published research. As such, the purpose of this post is to call for participation in setting up such a repository; in setting up an open access journal that can cover ground in reproducible, data-intensive research that the JEL does not cover; in developing a workflow workbench architecture for interoperability for Linguistics data and research; and in promoting the use of pipelines in research. This work is currently in it's very early stages, and any help would be appreciated. One of the ways to get involved is to join the dedicated listserv.

=Members=
 * Richard Littauer, Universität des Saarlandes
 * Steven Moran, Ludwig-Maximilians-Universität München
 * Peter Bouda, Ludwig-Maximilians-Universität München

=Current Tasks=
 * Install Django on the repo site, as a basis and as a blog for ideas regarding workflows. CF Django Packages
 * We integrate the workflow architecture of VisTrails with the IPython notebook. The goal is here that users can create and develop the computational parts in IPython and use VisTrails for workflow and data handling. Everything should be attached to a repository (subversion?). We instantly have client-side development, server-side runtime by using IPython.
 * Work on IPython target experiments for the repository itself. IPython notebook
 * Advantage: You have coding environment in the browser (implemented in JavaScript), while (Python) code is executed on the server(s). Users can also download code and execute locally. You add plots directly to the notebook, with matplotlib. We may have a look how to integrate IPython notebooks with VisTrails.
 * IPython may be a good start because we can learn how to implement the communication channel between a workflow system/experiment repository and the servers that execute code.
 * There are also software systems to enable OA journals and conferences: 12


 * We try to integrate other computational environemts like R and Java. We use VisTrails to call external packages, so we can still use it to handle the workflows, data and scripts.
 * We develop a web GUI for VisTrails, parallel to IPython. This would allow full web-access for users, while all the data, scripts and runtime is managed by the server.
 * Forum?
 * Journal?