OED Full Text

= Oxford English Dictionary (1st Ed) Full Text Online =

Purpose
Make the 1st edition of the Oxford English Dictionary (now in public domain) available online in an open form that facilitates reuse and improvement.

Specifically:


 * OCRing the current scans made available by Kragen Sitaker
 * Putting the resulting text online in a format that will make distributed proofing easy (e.g. a wiki)

Created: 2007-11-01

 * Original work on EB which led to idea in early 2007
 * 2007-11-01: post to list proposing the project
 * http://lists.okfn.org/pipermail/okfn-discuss/2007-November/000635.html

Project Members

 * RufusPollock

Participate

 * Via email: join the okfn-discuss list or email info [at] okfn [dot] org.

More Details
Original email to the list:

Working with the Shakespeare entry from 11th edition of the Encyclopaedia Britannica over the last year, and particularly the experience of using tesseract, has got me thinking about a couple of potential projects along the same lines:

1. OCRing all of the EB 11th edition and putting it up online. If we put this in something that was editable this might also be a quicker way to do the proofing than the pgdp approach (which is currently working on some of the earlier volumes but is proceeding fairly slowly). It is interesting that some people have done this kind of thing already (see examples at the end of the Wikipedia article on the EB 11th edition [1]) but all of them seem to be closed (i.e. claim copyright on the results).

2. Kragen Sitaker did amazing work back in 2005/2006 'liberating' the OED first edition which is now (mostly) in the public domain [2]. He posted up fairly good scans of volumes 1-6 on archive.org (see [2]). However at the time he was unable to do much on the OCR front (no doubt because of the poor performance of open source OCR, particularly on such a complex text as the OED which has lots of non-standard english and font changes). With the better open source OCR engine it would be possible to convert the OED back into text and perhaps wikify it to allow for gradual proof-editing and correction.

What do people think? Would this be something worth investigating further? For example I don't yet know how well tesseract would work on the OED text and this would obviously affect the value/cost trade-off.

~rufus

[1]:

[2]: 