Access; Compute; Reuse. How can we help people make better data-informed decisions?
To help us develop credible answers to complex questions we must combine data from many sources.
Whether in the public or private realms, we need better ways of working. Given that everything from policy to finance to individual behaviour is now ‘codified’, this will require better combinations of processes and tools. It’s not a ‘tech’ problem.
1. Make data more than just ‘computable’
‘Data’ is not only a much broader term than ‘code’ it should also not be seen as a ‘technology’. It can come load with opinions about how, why and when it should be used.
There are more than just the numbers at play. We need to know:
- The raw data
- The circumstances of its collection (how and from where)?
- How should it be used?
- What are the models for processing the data?
- Are the models valid?
- What methods and assumptions are used? (in words and equations)
- What is the significance of the assumptions being made and the limitations of the method(s)?
- What are the results of the processing?
Transparent and auditable disclosure (including peer review) requires the codification and publishing of the entire process under either open (or shared) licenses.
Peer-review requires exponential transformation given how technology has opened up potential usage to mass audiences. Especially in a public-policy context, this could benefit governments and organisations greatly — if the scrutiny and trust they seek were supported by both open data and open access to shared data on all of the above.
2. Create processes that make data useful
We have lots of prior art to build upon. Let’s consider six foundational pillars:
1. Collect — aggregate raw data, track and log its sources.
2. Enhance — embed provenance and add measures of the reliability and credibility of the source(s). Add taxonomies and semantic links that enable the data to be joined up.
3. Discover —make both machine-discovery and human-accessible tools so that it is easier to know what exists in what form.
4. Repeatable Quality — Measure and automate quality-control processes around the underlying data (akin to Six Sigma). Use both machine and human-scale tests to avoid systemic errors.
5. Computable and Auditable — enable the models to be run and calculations to be performed. Build clear and usable tools for engagement to stimulate engagement. Maintain an audit history of both inputs and calculations to ensure repeatability.
6. Enable interoperability —use tools (e.g. open APIs) to enable systems to interconnect in a comprehensive manner: technology is only a small part of the solution. The systems needed for data-sharing across siloes include managing a range of issues from intellectual property and rights, consent and liability, to business and legal rules, security and policy.
3. Build tools that enable broad engagement
Codifying data, models and methodologies is only one part of the jigsaw, being able to use it and compute with it should not require a degree in data science. We need to move beyond storing and re-broadcast data and help people perform calculations based on their own inputs to models. And, given the complexities, we need tools that can take us far beyond spreadsheets and into better understanding of the whole system.
Providing access to raw data is vital — in a useful, repeatable, and traceable form under either open or shared licensing. Equally, releasing the code behind the models is important for review, but it’s also dangerous if everyone just re-runs the same code with the same baked-in implicit and explicit models, assumptions and errors. The code and models should be written from scratch as many times as possible to reduce the chance that they affected the results in any way.
If we don’t build a credible ‘stacks’ that address data-sharing and translation with both a public and private sector focus we will increase risks, lose trust, miss opportunities and lose the potential for innovation. We have to comprehensively join the dots to see what is real and what is not, and build foundational credibility into our data infrastructure.