Projects/Open Shakespeare/AnnotationSystem

Problems to address/features we want:


 * co-ment is very slow with large documents. Adds  etc to every word and every piece of whitespace in document, clientside, after load (or so it would seem).
 * want to scale to arbitrarily large documents.
 * want to be able to display annotations for one version of document (e.g. Folio) on a different version (e.g. 2nd Quarto).
 * want to be able to display differences between versions (plug into, say, docdiff).

A possible architecture.


 * Texts stored in appropriate format (XML?) in version control repository.
 * Annotations stored as {document id, {start address, end address}, text, [other annotation data], ...}
 * Addresses: character offsets in XML? XPath + character offset? CDATA-only character offset (allows for clean transitions to/from text/XML)? My preference would probably be for the latter -- NickStenning.
 * Any (hopefully few) changes to document can be detected through VCS, and database updated: addresses modified to make sure annotations remain in the right place. Could easily(?) be made into a nice abstraction: given a directional diff between two versions of the text and a list of {start addr, end addr}, return a list of the new {start addr, end addr}. (Dropping annotations that correspond to removed text, etc.)
 * Copy annotations to alternate version? Annotations with multiple locations (i.e. 'Annotation' has-many 'Documents' through 'Locations')?
 * A possible algorithm for doing address translation:
 * Take a smallish chunk of the text centered on the relevant address: chunk
 * Find the address in the new document corresponding to the same proportional position (i.e. chunk appears 37% through the first document, so assume it will appear near there in the new document): pivot
 * Fuzzy full-text search the document for chunk, weighting text nearer pivot higher than that further away.
 * We'll end up with a few possible alignments, hopefully one being far more likely than the others. If so, we're sorted. If not, then at the very least we have an easy machine-assisted means of identifying the correct alignment.

I'm aware that this is all very vague. Any feedback or attempts to make my writing a bit clearer will be most appreciated, -- NickStenning.

Useful links:


 * http://clientside.svn.calyptus.se/JavaScript/Native/Selection.js
 * http://www.quirksmode.org/dom/range_intro.html
 * http://www.webreference.com/js/column12/crossbrowser.html

Reviews of available technology:


 * http://wiki.laptop.org/go/Annotation
 * Not much that meets our needs
 * http://blog.okfn.org/2007/01/24/thinking-about-annotation/ - summary of what we found back then