Software Tools/Databases
From Open Knowledge Foundation
Databases
Review of available database technologies construed in a broad sense (e.g. not just RDBMS).
Diagram of NoSQL systems: http://blog.nahurst.com/visual-guide-to-nosql-systems
Plain Ol' Filesystem
- Pair Tree: https://confluence.ucop.edu/display/Curation/PairTree
- DFlat: https://wiki.ucop.edu/display/Curation/D-flat
- Dflat defines a simple file system convention for storing a digital object. It imposes minimal architectural and policy constraints while reserving a small set of file system names (for directories and files) that place certain salient object features, if available, in well-known locations within a single directory hierarchy that houses the entire object. With no other knowledge about the nature or purpose of the object, human administrators and automated agents that visit a "digital flat" will be able to identify its "occupant" and understand certain of its properties such as content, metadata, version history, annotations, and administrative records.
NOSQL (Key/Value Stores)
Read the spreadsheet for full overview:
As of early 2010 codebases I've heard best things about are:
- mongodb (best document store) - more mature and with less issues than e.g. couchdb
- voldemort (best engineered)
- cassandra (slow to gain traction but getting there)
- python wrapper with good intro to concepts: http://github.com/digg/lazyboy
- http://wiki.apache.org/cassandra/DataModel
misc unformated notes on nosql stores
mongo has a 2 or 2.5Gb limit on 32bit hardware since it keeps the database mmap'd. otoh it supports deep indexing of documents which is crucial for storing rdf/json serialisation in the database.
cassandra: rigid schema, kv store, not document oriented. means it's hard to follow the same pattern of putting RDF since the value (object part of statements) has to be a list, ideally of dictionaries. could store the objects serialised, but then no indexes so no good.
couchdb and riak could do the same as with mongo, they don't have indexes in the same way. any searching is with map/reduce. without an index this means touching all objects in the database. probably performs badly.
4store on 32bit hardware has a limit as well, since it uses mmap heavily. generally if the ptree files grow too large it will blow up. the limit in practice will be higher than mongo's but still there.
could work around the mongo limit by running multiple instances of the back-end and using sharding. usually this is done on a cluster but it would work, I think, on a single host. each shard would have the 2Gb limit. not sure how well it would work.
RDF
- Versioning: Not as yet (see e.g. this SO question)
- Web app framework: not as yet
RDFLib
- Has several backends (in-memory, Berkeley DB, MySQL/Postgres)
- Can put it online: http://projects.bigasterisk.com/sparqlhttp/
- Performance not great with large datasets, particularly on joins
- Standard Python API objects, particularly Node types (URIRef, Literal, Namespace, etc.)
- In-memory RDF Graph useful for (post) processing query results, small sets of data, etc.
- Python
4store
- Backend with clustering support
- Very fast
- Uses Redland parser for RDF (raptor) and SPARQL (rasqal)
- SPARQL HTTP endpoint the standard interface
- Python bindings (use RDFLib) for talking directly to the store (see http://github.org/wwaites/py4s)
- Only one direct connection to the store possible at a time due to locking issues (support for multiple connections exists in multiclient branch at http://github.org/wwaites/4store)
Sesame
- Very mature and heavily used (and integrated into other systems)
- Java
Associated Tools
OLAP
Cubulus
- http://sourceforge.net/projects/cubulus/
- Python
- Basic support for MDX
- Not recently updated (last update 2007 AFAICT)
pentaho
- The World's Most Popular Open Source BI Suite
- The Pentaho BI Suite provides a full spectrum of business intelligence (BI) capabilities including query and reporting, interactive analysis, dashboards, data integration/ETL, data mining, and a BI platform that has made it the world's most popular open source BI suite.
Data Processing (ETL)
ETL = Extract, Transform, Load
snaplogic
- SnapLogic is an Open Source Data Integration framework that combines the power of dynamic programming languages with standard Web interfaces to solve today's most pressing problems in data integration.
- python-based