INSIGHTS/

Stories of project adventures and learnings told by our team data experts. /

Make Data Useful (Part 2)

As I mentioned previously, I’ve spent most of my career working with other people’s data. Building off some of the common challenges I will use this post to share my experience based perspective of the ‘state of the art’ in terms of making data useful.

Failures

My journey to here has been guided by a series of past projects, some of which I now view as failures. The largest failure and missed opportunity was a portal for environmental observation data called the Water and Environmental hub (WEHUB). WEHUB was in some ways ahead of its time. It indexed and provided a single point of entry to thousands of local and federated sources of water and environmental data from government, private, and academic sources.

WEHUB:

  • supported various OCG standards.
  • integrated the CUASHI HIS semantic model enabled referencing of observed properties between datasets.
  • enabled private, group, or public use access control for all datasets which provides a way to share and collaborate on data.
  • has a relatively sophisticated API that enabled the discovery and access of datasets stored both locally and by translating other external APIs into a common format.

In many ways WEHUB was a representation of the state of the art, at least from an academic sense. There was just one problem… few really used it. I’ve spent many nights trying to understand why this project was ultimately doomed to join the graveyard of Open Data portals. I came to the conclusion that the data was our biggest flaw. While we were able to broker access relatively effectively we had little control over the data itself and were unable to affect the reliability, predictability and ultimately usability of the data we provided via WEHUB. We had something for everybody but the catalogue did not target or focus on any particular user type.

The Beginning

Since then two projects have directly benefitted from the learnings of the WEHUB. The first was the Provincial Growth and Yield Initiative Plot Sharing App. PGYI/piggy as it was lovingly called was created to enable 16 government and industrial members of Forest Growth Organization of Western Canada (FGrOW) to seamlessly share forest plot measurement data with each other and know that the data will be interoperable and up to quality assurance specifications. The specifications were a little complicated and our team’s initial attempt at implementing a Python backend failed to handle the sprawling permutations of how CSV data can vary. Structural issues, dialect issues, logical issues, encoding issues, and data size issues compounded into what seemed like infinite exceptions and edge cases and we felt there was no end in site. We were failing.

Yves Richard and I joined the project to see what we could do to help. We thought we could get things back on track within 3 months (wishful thinking on the 3 months part) with some new thinking, leveraging a distributed systems approach and the capabilities of Amazon Web Services. We understood the challenge to be primarily relating to separation of concerns. Each data issue would be have to be detected and dealt with in a critical path as completing the validation in one step was a nightmare.

We set off to design a new architecture which would systematically deal with with all things CSV. We had been following Max Ogden and his efforts on Dat for a while and through his Github activity we became familiar with the Open Knowledge Foundation and a set of standards which have since become the Frictionless Data initiative.

In particular the Data Package Standard and JSON Table Schema appeared to be the bullet we were looking for. A standard for providing CSV and metadata and another to describe its expected contents in a detailed and extensible way.

Leveraging these standards we decided we would import data by first verifying the encoding, structure, and dialect of the CSV. If that passed we would verify its content against the JSON table schemas. Finally, knowing we had valid files and valid contents we could import the data into PostGreSQL and leverage stored procedures to accurately validate complex business rules relating to biological validation (trees don’t grow backwards, etc.).

Reliable Data

We found that higher level languages were less reliable in terms of handling the sprawling permutations of CSV data we were seeing. We ended up creating bawlk, leveraging Awk (designed for text processing), a lower level Linux goodie that enabled us to efficiently and reliably validate and sanitize massive amounts of CSV data. We created a simple data pipeline to deploy the output scripts. Essentially CSV data would land in an S3 bucket, where they would be picked up by datapackage-validator-awslambda and be validated. This gave us a distributed systems approach for handling many consecutive operations. Imports would either pass as valid or write out specific and detailed validation errors into the S3 bucket where our API could readily access them when requested by the web-client.

The Import Violation Process in the Provincial Growth and Yield Initiative Plot Sharing App.

With this basic level of structural and content assurance in place we could reliably load the data into an RDS PostgreSQL store where further logical validation could be performed without the likelihood of the database choking on structural and content issues.

We had solved half the problem. We had the back-end in place to ensure that the data we were sharing was reliable. It was either error free or the errors that existed were systematically identified. PGYI was primarily oriented towards power users who would export the data into their modelling and analytics system. As such we didn’t take it much further than some basic summary and reporting functions. From their perspective having a common plot data store which enabled contribution by member companies according to a common specification was the win. This dataset which met their standards could now be used for modelling and forestry analytics applications without each individual taking on the quality assurance required to synthesize the data. From my perspective what was most striking about this project was this shared contribution model enabled potentially competitive companies to create a common dataset which provided common ROI to all involved.

Taking it Further

When we started a project with the Gordon Foundation and Government of Northwest Territories later to become the MacKenzie Datastream we were keen to take this concept further. DataStream’s mission is to promote knowledge sharing and advance collaborative and evidence-based decision-making throughout the Basin. The MacKenzie basin is extremely large, measuring 1.8 million square kilometres and as such monitoring is a large challenge. To overcome this challenge, water quality monitoring is carried out by a variety of partners which include communities and Aboriginal, territorial and federal governments. With multiple parties collecting and sharing information MacKenzie DataStream had to overcome challenges of trust and interoperability. A community based monitoring system would be no use if each dataset varied and trust and interoperability couldn’t be established.

We understood that reliable data was the foundation of usable and interoperable data and we were keen to apply the methods we discovered with the PGYI project. We applied bawlk and the related AWS technology within a simpler user facing validation workflow. We developed a method of generating CSV templates from the JSON Table Schemas that a Data Steward could download to prepare their data for import. Using the template the Data Steward could access a simple user interface and upload the data. They would receive warning messages for any content issues and could readily download and fix any validation errors.

Using the BAWLK pipeline to validate uploads.

To make the concept of JSON Table Schemas more interpretable we created an interface for it that we referred to as a Data Theme. The Data Steward view of a Data Theme was essentially a friendly schema definition defining the fields required, the validation constraints, and the values expected within.

Data theme user interface (User Facing JSON Table Schema)

Predictable Data

Data Themes also empowered what we think of as predictable data. By ensuring conformity to a well thought out structure we can do more with the data in the system. Perhaps more importantly this predictable data enables users of the Datastream data to understand what they are getting and decide what additional quality assurance they might need to apply for their purpose. Scientists and consultants for instance can be sure that a numeric field will not have a random field note comment (perhaps a string containing special characters) that might have caused us havoc in the past.

Useful Data

For data to be useful it has to be analyzed or interpreted. We struggled here in the past where the large variety of datasets of unknown structure in platforms like WEHUB made making meaningful interpretations difficult. Every dataset needed to be thought of differently. This flexibility pushes the responsibility for visualization and interpretation to the end-user who may or may not know how to make meaningful interpretations. In the case of Datastream we leveraged the predictable data to generate a set of standard visualizations for each dataset. We connected the data API to Plot.ly and were able to produce meaningful visualizations without difficulty. Plot.ly is a comprehensive and well supported visualization platform. It handles the basics chart types you’d expect but also provides us the foundation for more scientific chart types like box charts which will inevitably make the interpretation of water quality data easier.

Using Plot.ly API to generate a timeline graph for temperature at multiple locations.

With the visualizations in place the data was starting to feel useful. We had moved beyond the point where in the past we would start running into significant friction. We took this further and created a component for simple statistical summaries which displayed for each parameter at whichever location the user had selected. While hardly advanced analytics it enabled us to prove the concept that predictability was also a good foundation for analytics and the platform may extend in that direction in the future.

Simple analytics on predictable data

Datastream had solved a significant portion of the problem that often remained in Open Data catalogues. We had developed a foundation and the quality assurance technology to ensure the foundation worked as intended. From there we demonstrated some basic visualization and interpretation capabilities and provided a platform for expansion. Perhaps, most importantly, the data was open, really open.

Open Data

The partners involved in Datastream committed to Open Data. Not the kind of open data that we discussed previously and makes data users sad, but the real deal. The Data Policy on Datastream identifies making data broadly available without restriction (Open Access). Open Access is a growing movement worldwide. It enables leading organizations to contribute to the data commons which in turn enables scientists to better understand and model our world. The principles guiding Datastream’s data policy are as follows:

1. Ethically Open Access

2. Quality

3. Interoperability

4. Security and Sustainability

Make Data Useful

We continue to advance Datastream and other efforts to make data useful at Tesera. We certainly don’t have it figured out but I believe the fundamentals of reliable, predictable, useful, and open data are on the right track as the projects we are developing are gaining traction. Datastream is expanding in the Northwest Territories and two other significant groups are actively working on creating their own regional Datastreams. The PGYI tool has consolidated a lot of industry data and is producing quite the dataset. The efforts are helping us engage with new industry partners who are looking to gain more business value out of their data.

The Tesera team is working to advance the microservices infrastructure that drives Datastream within AWS. We are also exploring integrating technology including the powerful Elastic database to provide more robust and flexible handling of observation data. We are running a few internal experiments including the development of a product we call Lightshed which makes our company’s business data more transparent to team members. We look forward to working with our team and our clients to make more data more useful!


Make Data Useful (Part 2) was originally published in tesera on Medium, where people are continuing the conversation by highlighting and responding to this story.

Ranunculus lapponicus — explore this dataset TODAY!

Ranunculus lapponicus is extremely fun to say and it is also the scientific name for Lapland buttercup. I know this, because I’m an insider — who helped the Alberta Biodiversity Monitoring Institute (ABMI) build their mapping portal.

What is the map portal for?

Launched TODAY, the mapping portal connects users to thousands of ABMI’s datasets and provides the ability to easily compare, share, print, and get detailed statistical summaries.

THE TEAM

For this project, Team Awesome consisted of 7 skilled and passionate team members.

All work was remote and the group that brought the portal to life looked like this:

ABMI

Tara Narwani — Project Consultant
Katherine Maxcy — Project Coordinator
Joan Fang — Systems Analyst

Tesera

Lori Logan — Product Manager
Spencer Cox — Technical Advisor
Yves Richard — Senior Developer
April Damaso — UX Designer

BUILT REMOTELY

Our team has been working remote for over a decade. When we collaborate with clients, we onboard and teach them about our communication norms, our remote focused project tools and our version of agile.

Slack for team communication

Each project gets its own dedicated “channel” which simply acts as the one place for team communication. All Tesera projects/channels are open to our entire team, regardless if they are officially assigned to the project. We do this to foster open and transparent communication and to maximize shared learnings.

Trello for task management

We love Trello, and each project also gets its own dedicated board. All tasks are managed on the board and everything moves from left to right. We even have our own feel good mantra for this — #ttr (to the right). Work is scheduled and prioritized for the week (a Sprint). Product Owners sign off on prioritized cards at the beginning of the week and at the end of it, work is either accepted, or it is moved back for further iteration.

Google Drive for document storage and collaboration

Projects are organized under a main folder simply named Collaboration Space. Beneath this we have standard folders but we keep structure flexible and try not to nest folders within folders — we use search instead! 🙂

CodeShip for continuous integration

We use CodeShip for continuous integration, so that changes can be seen as they are made. The minute a developer commits to Github our client can see the latest and greatest version of the app. This enables continuous feedback and direction can be provided significantly sooner. This ultimately saves time and cost.

LEARN FAST

This project had a moderate budget and a tight time line. We decided that a concurrent design and development approach made the most sense. Designs started as low fidelity (paper), progressing to InVision for a high fidelity experience, while at the same time our developer built the basic map viewer features and primary statistical summary functionalities.

As part of our transparent development process, we use continuous integration so our clients see their product evolve. With ABMI, we used weekly meetings to review progress and demo the current functionality. This was key to rapid development, in that changes to feature requirements become obvious when users got a chance to interact with it.

SHARE WHAT YOU LEARN

This was the first project we were able to utilize our revised retrospective process. It was an opportunity to live our values around communication, accountability, community, personal development and relationships.

It was truly an eye opening, valuable learning experience. A structured retrospective process provides the space for the team to discuss and understand different perspectives, experiences, and share learnings to contribute to personal and team growth. From the Retrospective Prime Directive

At the end of a project everyone knows so much more. Naturally we will discover decisions and actions we wish we could do over.

This is wisdom to be celebrated, not judgement used to embarrass.

If you haven’t yet committed to this process — start on your next project. Here are a couple resources to get you going…

  1. 5 Steps To Better Agile Retrospectives
  2. Trello Agile Series: Roadmaps & Retrospectives

PARTNERSHIPS

Our team is appreciative of the opportunity to work with this special group of ABMI professionals and their best in class data. We share the excitement of today’s official launch and this new way for users to interact and explore ABMI’s vast datasets.

Go explore the mapping portal, learn something new about a dataset in your world, make a custom map and share it with someone else. There’s even a built in share tool so it is literally one click.


Ranunculus lapponicus — explore this dataset TODAY! was originally published in tesera on Medium, where people are continuing the conversation by highlighting and responding to this story.

Callbacks — A must-have tool for data scientists

Collaborate on and maintain data scientific code like a “PRO”grammer.

From a data science perspective, callbacks are a great pattern to reduce redundancy in code.

But that’s not how I was introduced to them. For me, callback brings to mind javascript, which was where I first heard the word. I remember it something like…

https://medium.com/media/687f961ee17024f1b7e1d2d4cdec4554/href

…because callbacks in javascript are gently introduced like this:

$("button").click(function(){
$("p").hide("slow", function(){
alert("The paragraph is now hidden");
});
});

which quickly turns into what is known as ‘callback hell’. Google that at your own risk.

For this article, if it’s rolling around your grey matter, get javascript out of your mind! Not all callbacks are asynchronous or have the(error, result) function signature.

So let’s define callbacks more generally:

a function, which is passed as an argument to another function, and executed later.

I used callbacks for years and never knew about the term ‘callback’.

R and Python (pandas) have a common use for callbacks that you’ve probably used before — the apply function.

Python

d = pd.DataFrame({'a': range(1,11), 'b': range(11,21)})
d.apply(np.mean)
a     5.5
b 15.5

R

l <- list(a = 1:10, b = 11:20)
sapply(l,mean)
a    b
5.5 15.5

When would you want to implement this pattern in your work?

Consider this motivating example: without apply, you’d have to write a function any time you wanted to execute a function over a dimension of a data frame.

For example (pseudocode alert!):

def my_apply_mean(df):
res = []
i = 0
while i < in nrow(df):
res[i] = mean(df[i,])
i += 1
def my_apply_median(df):
res = []
i = 0
while i < in nrow(df):
res[i] = median(df[i,])
i += 1

Instead, apply accepts an arbitrary function as an argument and calls it on the data frame elements, like so:

def my_apply(df):
res = []
i = 0
while i < in nrow(df):
res[i] = median(df[i,])
i += 1

The symptom to watch out for here is repeated structure of several functions with very little difference, except for a call to different functions at the same place within that structure.

An alarm bell is multiple function calls with similar signatures inside another function.

def simulate(years, **kwargs):
result = []
    for year in range(years):
property_a = estimate_property_a(year,
kwargs['other_property_a'])
property_b = estimate_property_b(year,
kwargs['input_property_b'])
property_c = estimate_property_c(year,
kwargs['some_property_c'])
# repeated several more times...
    result.append({
'property_a': property_a,
'property_b': property_b,
'property_c': property_c,
# repeated several more times
})
    return result

That’s not verbatim, but you get the idea.

def do_yearly(funcs, **kwargs):
output = []
for year in range(years):
for fun in funcs:
result = fun(year, ???)
output.append
    return output
def simulate(**kwargs):
return do_yearly(funcs=[estimate_property_a,
estimate_property_b,
estimate_property_c],
**kwargs)

The tricky part is getting the other arguments into each of the property estimators.

If each property estimator takes arbitrary keyword arguments, and there aren’t conflicts, then the example above is sufficient. Instead of kwargs, another option is to pass a class instance that holds the required state. There’s other design patterns that could be used, such as the action/executor. At this stage, that looks like overkill to me.

At this point, this is all reflection as I haven’t done this refactor. If something has worked well for you in this situation, let me know!

Thanks to Spencer Cox, Lori Logan for improving this post. Special thanks to Yves Richard for the spirited discussion about callbacks, async, and function composition.


Callbacks — A must-have tool for data scientists was originally published in tesera on Medium, where people are continuing the conversation by highlighting and responding to this story.