Projects/Distributed Storage/Plan

= Plan =

= Current Status: 2010-01 =

Grid is operational but is still very much alpha as we have not yet resolved issues around permissions and space allocation. Problem is tahoe has a lot of privacy stuff and is not particularly oriented to our use case, see:


 * Access control and permissions on a tahoe grid (tahoe-dev thread)
 * How does file deletion work?

Have also spoken at some length with an expert in wide-area distributed storage field (what this is) who completed a PhD at Cambridge in the area a couple of years ago. His basic evaluation was that he didn't yet know of any project that had really solved this problem and had available easily-installable, ready-to-use software.

= Plan: 2009-08 =

TODO: summarize tahoe security model and how it affects us (roughly: everything is based on capabilities and all files are encrypted so how do we control what is there ...) -- see these threads: Access control and permissions on a tahoe grid, How Does File Deletion Work?

We are storing code (and some docs) in a mercurial repo: https://knowledgeforge.net/okfn/grid/


 * DONE 2009-05 Get basic tahoe grid up and running
 * Layer our own web interface over the web frontend to provide required permissioning
 * DONE 2009-06-26 Replace simple html front page with basic pylons webapp
 * DONE 2009-07-03 Functioning upload form
 * DONE 2009-07-03 Browse virtual filesystem
 * Implement file ownership
 * Requires login/logout support
 * Implement basic permissioning (owner and admin can do anything, everyone else can read)
 * Sort out basic subdirectory structure (suggest: /home/, /var/{project-name}/, ...)
 * Think about how all of this works on local nodes
 * (e.g. does person who uploads at a local node also have to register ownership on grid.okfn.org)

= From Julian 2009-06-26 =

Tahoe grid use cases for OKFN
I have several distinct examples of data which would better be stored in the grid, than currently in the kforge SVN system.

(1) Large quantity of webscraped United Nations documents in PDF form (both text and scanned):http://knowledgeforge.net/ukparse/svn/trunk/undata/pdf/

(2) pdf2xml versions that have been corrected for typos and other problems: http://knowledgeforge.net/ukparse/svn/trunk/undata/pdfxml/ (We are not interested in the intermediate versions -- just the original (which can be obtained by calling pdf2xml again) and the final edit that parses.)

(3) Large webscraped HTML from UK Parliament which is saved under different versions that all need to be accessible (-a, -b, etc) http://knowledgeforge.net/ukparse/svn/trunk/parldata/cmpages/

(4) Scanned hand-written maps, field notes and pages from logbooks http://knowledgeforge.net/sesame/club/mmmmc/Ireby%20Fell%20Cavern/rawscans/

Why SVN is inappropriate
SVN was made for code-bases. But it's been used on kforge for hosting files of the types listed above, because it manages (a) the backups, (b) syncronizing the data into other people's directories.

However, the versioning and diff features for such files are entirely counterproductive, and costly by the way it stores an unnecessary duplicate of every file that has been checked out.

It's easy to forget what a special application coding is, and why versioning works for it, but doesn't for documents of the listed sort.

With code, you (and other coders) make simultaneous batches of files. It's considered wrong if you check in code that doesn't compile -- ie all the links between the files changed consistently. That's because all the links are internal to the project.

With data, the links are coming from the outside. So you cannot change a document, whilst keeping its identity (name) the same. For legal documents, that means your incoming references will break, so you need to publish amendments or revisions and keep all versions available.

With scans of paper evidence, although you could revise them by rescanning at a better resolution, you can't do this because if anything refers to excerpts of these scans by their pixel coordinates, they'd be broken. So these also can't be versioned like code.

Use cases for Tahoe grid
We get the same two features of backing up, and sychronising the new stuff (not file diffs) to a different computer.

case 1
The whole repository of a particular type is all copied down into the server. Tahoe is used as a backup and moving of the server's data. Everything is available for processing across all of the data. The server can add new files to the repository.

This case applies to undemocracy and parlparse because all the data is needed in order to present users with statistical figures (eg attendance rates in votes).

case 2
Files are served directly out of Tahoe through a server. The full repository is not copied down. The server merely caches some files. More useful for delivering the PDF documents or images where statistical analysis are not always wanted.