As I mentioned previously, I’ve spent most of my career working with other people’s data. Building off some of the common challenges I will use this post to share my experience based perspective of the ‘state of the art’ in terms of making data useful.
My journey to here has been guided by a series of past projects, some of which I now view as failures. The largest failure and missed opportunity was a portal for environmental observation data called the Water and Environmental hub (WEHUB). WEHUB was in some ways ahead of its time. It indexed and provided a single point of entry to thousands of local and federated sources of water and environmental data from government, private, and academic sources.
- supported various OCG standards.
- integrated the CUASHI HIS semantic model enabled referencing of observed properties between datasets.
- enabled private, group, or public use access control for all datasets which provides a way to share and collaborate on data.
- has a relatively sophisticated API that enabled the discovery and access of datasets stored both locally and by translating other external APIs into a common format.
In many ways WEHUB was a representation of the state of the art, at least from an academic sense. There was just one problem… few really used it. I’ve spent many nights trying to understand why this project was ultimately doomed to join the graveyard of Open Data portals. I came to the conclusion that the data was our biggest flaw. While we were able to broker access relatively effectively we had little control over the data itself and were unable to affect the reliability, predictability and ultimately usability of the data we provided via WEHUB. We had something for everybody but the catalogue did not target or focus on any particular user type.
Since then two projects have directly benefitted from the learnings of the WEHUB. The first was the Provincial Growth and Yield Initiative Plot Sharing App. PGYI/piggy as it was lovingly called was created to enable 16 government and industrial members of Forest Growth Organization of Western Canada (FGrOW) to seamlessly share forest plot measurement data with each other and know that the data will be interoperable and up to quality assurance specifications. The specifications were a little complicated and our team’s initial attempt at implementing a Python backend failed to handle the sprawling permutations of how CSV data can vary. Structural issues, dialect issues, logical issues, encoding issues, and data size issues compounded into what seemed like infinite exceptions and edge cases and we felt there was no end in site. We were failing.
Yves Richard and I joined the project to see what we could do to help. We thought we could get things back on track within 3 months (wishful thinking on the 3 months part) with some new thinking, leveraging a distributed systems approach and the capabilities of Amazon Web Services. We understood the challenge to be primarily relating to separation of concerns. Each data issue would be have to be detected and dealt with in a critical path as completing the validation in one step was a nightmare.
We set off to design a new architecture which would systematically deal with with all things CSV. We had been following Max Ogden and his efforts on Dat for a while and through his Github activity we became familiar with the Open Knowledge Foundation and a set of standards which have since become the Frictionless Data initiative.
In particular the Data Package Standard and JSON Table Schema appeared to be the bullet we were looking for. A standard for providing CSV and metadata and another to describe its expected contents in a detailed and extensible way.
Leveraging these standards we decided we would import data by first verifying the encoding, structure, and dialect of the CSV. If that passed we would verify its content against the JSON table schemas. Finally, knowing we had valid files and valid contents we could import the data into PostGreSQL and leverage stored procedures to accurately validate complex business rules relating to biological validation (trees don’t grow backwards, etc.).
We found that higher level languages were less reliable in terms of handling the sprawling permutations of CSV data we were seeing. We ended up creating bawlk, leveraging Awk (designed for text processing), a lower level Linux goodie that enabled us to efficiently and reliably validate and sanitize massive amounts of CSV data. We created a simple data pipeline to deploy the output scripts. Essentially CSV data would land in an S3 bucket, where they would be picked up by datapackage-validator-awslambda and be validated. This gave us a distributed systems approach for handling many consecutive operations. Imports would either pass as valid or write out specific and detailed validation errors into the S3 bucket where our API could readily access them when requested by the web-client.
With this basic level of structural and content assurance in place we could reliably load the data into an RDS PostgreSQL store where further logical validation could be performed without the likelihood of the database choking on structural and content issues.
We had solved half the problem. We had the back-end in place to ensure that the data we were sharing was reliable. It was either error free or the errors that existed were systematically identified. PGYI was primarily oriented towards power users who would export the data into their modelling and analytics system. As such we didn’t take it much further than some basic summary and reporting functions. From their perspective having a common plot data store which enabled contribution by member companies according to a common specification was the win. This dataset which met their standards could now be used for modelling and forestry analytics applications without each individual taking on the quality assurance required to synthesize the data. From my perspective what was most striking about this project was this shared contribution model enabled potentially competitive companies to create a common dataset which provided common ROI to all involved.
Taking it Further
When we started a project with the Gordon Foundation and Government of Northwest Territories later to become the MacKenzie Datastream we were keen to take this concept further. DataStream’s mission is to promote knowledge sharing and advance collaborative and evidence-based decision-making throughout the Basin. The MacKenzie basin is extremely large, measuring 1.8 million square kilometres and as such monitoring is a large challenge. To overcome this challenge, water quality monitoring is carried out by a variety of partners which include communities and Aboriginal, territorial and federal governments. With multiple parties collecting and sharing information MacKenzie DataStream had to overcome challenges of trust and interoperability. A community based monitoring system would be no use if each dataset varied and trust and interoperability couldn’t be established.
We understood that reliable data was the foundation of usable and interoperable data and we were keen to apply the methods we discovered with the PGYI project. We applied bawlk and the related AWS technology within a simpler user facing validation workflow. We developed a method of generating CSV templates from the JSON Table Schemas that a Data Steward could download to prepare their data for import. Using the template the Data Steward could access a simple user interface and upload the data. They would receive warning messages for any content issues and could readily download and fix any validation errors.
To make the concept of JSON Table Schemas more interpretable we created an interface for it that we referred to as a Data Theme. The Data Steward view of a Data Theme was essentially a friendly schema definition defining the fields required, the validation constraints, and the values expected within.
Data Themes also empowered what we think of as predictable data. By ensuring conformity to a well thought out structure we can do more with the data in the system. Perhaps more importantly this predictable data enables users of the Datastream data to understand what they are getting and decide what additional quality assurance they might need to apply for their purpose. Scientists and consultants for instance can be sure that a numeric field will not have a random field note comment (perhaps a string containing special characters) that might have caused us havoc in the past.
For data to be useful it has to be analyzed or interpreted. We struggled here in the past where the large variety of datasets of unknown structure in platforms like WEHUB made making meaningful interpretations difficult. Every dataset needed to be thought of differently. This flexibility pushes the responsibility for visualization and interpretation to the end-user who may or may not know how to make meaningful interpretations. In the case of Datastream we leveraged the predictable data to generate a set of standard visualizations for each dataset. We connected the data API to Plot.ly and were able to produce meaningful visualizations without difficulty. Plot.ly is a comprehensive and well supported visualization platform. It handles the basics chart types you’d expect but also provides us the foundation for more scientific chart types like box charts which will inevitably make the interpretation of water quality data easier.
With the visualizations in place the data was starting to feel useful. We had moved beyond the point where in the past we would start running into significant friction. We took this further and created a component for simple statistical summaries which displayed for each parameter at whichever location the user had selected. While hardly advanced analytics it enabled us to prove the concept that predictability was also a good foundation for analytics and the platform may extend in that direction in the future.
Datastream had solved a significant portion of the problem that often remained in Open Data catalogues. We had developed a foundation and the quality assurance technology to ensure the foundation worked as intended. From there we demonstrated some basic visualization and interpretation capabilities and provided a platform for expansion. Perhaps, most importantly, the data was open, really open.
The partners involved in Datastream committed to Open Data. Not the kind of open data that we discussed previously and makes data users sad, but the real deal. The Data Policy on Datastream identifies making data broadly available without restriction (Open Access). Open Access is a growing movement worldwide. It enables leading organizations to contribute to the data commons which in turn enables scientists to better understand and model our world. The principles guiding Datastream’s data policy are as follows:
1. Ethically Open Access
4. Security and Sustainability
Make Data Useful
We continue to advance Datastream and other efforts to make data useful at Tesera. We certainly don’t have it figured out but I believe the fundamentals of reliable, predictable, useful, and open data are on the right track as the projects we are developing are gaining traction. Datastream is expanding in the Northwest Territories and two other significant groups are actively working on creating their own regional Datastreams. The PGYI tool has consolidated a lot of industry data and is producing quite the dataset. The efforts are helping us engage with new industry partners who are looking to gain more business value out of their data.
The Tesera team is working to advance the microservices infrastructure that drives Datastream within AWS. We are also exploring integrating technology including the powerful Elastic database to provide more robust and flexible handling of observation data. We are running a few internal experiments including the development of a product we call Lightshed which makes our company’s business data more transparent to team members. We look forward to working with our team and our clients to make more data more useful!