Projects/Where Does My Money Go/Data/Data-specification

Overview
This document is a work in progress, in which we specify how we would like to model the data available. We have indicated where the uncertainties are, and where more detailed investigation is needed.

Specifications refer to documents held in the Data summary unless otherwise noted.

Estimated workload

 * 4 days data processing
 * 4 days vector graphic processing
 * 2 days manual spreadsheet processing

Model
Spending in all spreadsheets can be defined as parcels. All spending parcels have:


 * Value: how much
 * Date: over an applicable time period
 * Location: regional area, national or overseas spending
 * Spending agency: who spends the money
 * Spending item: spending details of how, and on what

That is:
 * how much?
 * spent when?
 * spent where?
 * spent by who?
 * spent on what?

These are our basic elements. From these elements we can create aggregations and slices using any combination of date, location, agency, item.

Spending items also exist within one or more classifications. Each source spreadsheet defines its own classification structure: items are grouped, and these groups may form a local hierarchy. Each low-level item is grouped with others, forming an aggregation, and so on.

These classification structures exist separately from the spending parcels. They are used to describe the way that parcels are combined into spending classifications.

If a classification is in use, each item points to its immediate parent. Each parent is also a spending item. Parents can point upward to other parents, and so on to the top of the hierarchy.

Each item can have more than one parent, allowing it to be linked to more than one spending hierarchy. This allows us to handle a local hierarchy, and to link to another one at the same time.

Spending items are further allocated within one or more top-level COFOG function/subfunction groups. Where a spending item is linked to more than one COFOG subfunction, program spending within each function is separately identified.

Spending can be further subdivided between capital or current spending, but this is not always defined.

We have a number of different hierarchies in operation:
 * spending authorities form a hierarchy;
 * locations form a hierarchy and are linked to the spending authorities;
 * spending programs are defined within hierarchies.

We can have different hierarchies at the same time. For instance, health authorities and government authorities don't use the same boundaries.

We will need to manually map spending functions to top-level COFOG functions in all local spending spreadsheets. We estimate 2 days work for somebody.

Mapping data
Maps are available as detailed vector graphics at http://www.statistics.gov.uk/geography/maps.asp This page gives an idea of how often boundaries have changed over the last 10 years.

Processing will be time consuming and costly, unless we can find a reliable pre-processed source.

Regional boundaries change with time, and sometimes names change. (This can show up as discontinuities in spreadsheets.) If we want to do do this in detail, we will need to process all 5 Administrative Geography maps, and 7 Health Geography maps.

Administrative geography
Tiers:
 * Government office regions
 * Tier 1 councils: metropolitan authorities, county councils
 * Tier 2 councils: district and borough councils

England has a total of 9 GOR's, 150 Tier 1 authorities, and 248 Tier 2 authorities.

There are 5 Administrative Geography maps.

Health geography
This is a special case. Spending maps onto:


 * Strategic Health Authorities
 * NHS Primary Care Trusts

England Health map has 10 SHA's, and 348 NHS Primary Care Trusts

There are 7 Health Geography maps.

Spending structures are not well understood. There is a risk of double-accounting across STA's and NHS trusts. We will need to do more research before we understand how to handle this data.

Statistical data
We have some very useful statistical data, specifically the documents in HM Revenue and Customs: Income tax statistics and distributions. This may allow us to make some quite detailed assumptions when estimating people's personal tax statements, and possibly also estimates of tax revenues.

Further investigation is needed to establish exactly what we have, and how we can use it in this domain, but we should aim for something like this:


 * tax rates against income
 * regional demographics: population, age, income, gender
 * details of taxpayers by GOR