What’s the Impact of Dirty Data?

Three NYC data experts weigh in on the importance of data observability for a pristine data flow.

Written by Tyler Holmes
Published on Nov. 23, 2021
Brand Studio Logo

Lake Erie has been a major water problem for the Midwest since the 1960s. Plagued by pollution due to a heavy industrial presence and the staggering 11.6 million people living around its shores, it’s been one of the most difficult areas to keep clean. ​The reason? Even with cleanup acts in effect throughout the region of the Great Lakes Basin, outside sources like the Cuyahoga River in Ohio and the Detroit River in Michigan keep pollutants consistently flowing in — eliminating any progress.

While the tech industry might not be so dependent on waterways like Lake Erie for functionality, companies are becoming increasingly reliant on compiled data sources, or “data lakes,” to keep complex tech stacks running. And like the Great Lakes, it’s possible for data lakes to become easily contaminated without preventative measures in place to monitor the sources that feed into them. But how can data scientists and engineers keep an eye on so much data while simultaneously warding off obstructions and much-dreaded downtime?

It starts with data observability.

By implementing automated methods of data monitoring and system alerts when an issue arises, teams are better prepared to combat data flow errors while simultaneously building reliability into their processes. That’s why Built In NYC caught up with three data experts to learn more about their data observability best practices, the biggest challenges they’ve been up against recently and all the tools helping them keep their data so fresh and clean.

 

Image of Nick Hardy
Nick Hardy
Director of Data • Ro

 

What is one of the most critical best practices your team follows when it comes to data observability, and why?

Like many organizations, Ro consolidates data from a variety of upstream sources into a unified data lake that our teams rely on for reporting, analysis, model building and more. The last thing we want is to discover that we have a pipeline issue because a dashboard is broken or an automated report returned nonsensical results. We’ve relied on a few key principles to avoid these situations.

First, we’ve endeavored to instrument monitoring and logging on every step of our pipeline. This way if we run into issues, we can pinpoint exactly where things are breaking and remediate them more efficiently. It’s crucial to have a robust picture of your pipeline so you can quickly get to the “why” in those moments.

For anything automated, we also implement “completeness” checks that prevent faulty outputs from being displayed when there’s an upstream issue. This practice has opened up opportunities for us to create a more comprehensive shared understanding of our pipeline mechanics across teams.

Lastly, we’re committed to iterative improvement. If something breaks, we will review it, identify the underlying issues, improve our approach and extend those improvements across the system.

 

What are some of the tools your team is using to streamline and automate data observability and monitoring? And what made you decide to use these tools over other options on the market?

As a healthcare company, everything that we build is built through the lens of delivering the safest, highest quality care and best experience for our patients. As a result, there are many instances where we’ve decided to build our own solutions to address some of our unique — but absolutely critical — patient safety and care quality problems.

The most notable of these is our medical alerting framework, which was built to support our amazing team of providers and pharmacists. This framework does two things: It checks that the data our providers use is always complete and up-to-date, and it runs a series of checks to ensure we’re treating a patient with the highest quality care. For example, alerting our team to allergies or drug-to-drug interactions when a prescription is being dispensed by our pharmacy.

We’re extremely proud of this work, as it augments the deep expertise of our providers and pharmacists in service of keeping our patients safe and healthy. It also allows us to gracefully scale that high-quality care to as many patients as possible.

The last thing we want is to discover that we have a pipeline issue because an automated report returned nonsensical results.”

 

What’s the biggest challenge your team has faced with data observability?

The hardest part of any monitoring and observability exercise is typically finding the right balance between signal and noise. You want to catch issues quickly, but at the same time, you need to be wary of alert fatigue. We’ve found success combating it through a few avenues.

In instances where we’re relying on thresholds or more rigid heuristics to sound alarms, we’re constantly revisiting these rules and tweaking them as our business evolves.

Where possible, we’re building in dynamic anomaly detection over these sorts of heuristics. This can be a challenging endeavor to take on in-house, both in terms of having the necessary data to build a good model, and the expertise to do so. But many of the third-party solutions in this space offer off-the-shelf anomaly detection as an excellent starting point.

We’ve found dimensionality to be an excellent lever for combatting noise. While it’s always important to measure and monitor your top and bottom-line numbers, breaking things down into their atomic units is often a very effective strategy for isolating the real issues.

 

 

 

Image of Ashish Singh
Ashish Singh
Data Engineering Manager • Tipico - North America

 

What is one of the most critical best practices your team follows when it comes to data observability, and why?

As we cater to an online betting target customer, having a 360-degree exposure to the data related to customers is an utmost priority for the company. Every bet creates a footprint into our complete ecosystem, which gets us some information or the other like trends, history, logs and more.

Maintaining this information in a secure manner is the utmost focus of the team. We achieve this by adopting industry standard tools and treating every piece of information coming into the system with integrity. Passive monitoring and clear team communication with secured data transfer is more predominant than other priorities in the company.

 

What are some of the tools your team is using to streamline and automate data observability and monitoring? And what made you decide to use these tools over other options on the market?

We use a variety of tools in the data platform: NIFI, Talend, Redshift, Airflow and Domo to name a few. We chose this stack primarily for its flexibility, scalability and its capability to create dynamic workflows.

Our business runs 24 hours a day, seven days a week, which gives us close to negligible time to grapple in for solutions.”

 

What’s the biggest challenge your team has faced with data observability?

One of our biggest challenges is experiencing no network delay and no data loss during business hours. Our business runs 24 hours a day, seven days a week, which gives us close to negligible time to grapple in for solutions. To overcome this situation, our in-place reconciliation process handles data anomalies in the early stages of data inception. Our alerting system helps us attend to these anomalies on time and helps avoid further impacts.

 

 

Image of Romain Debordeaux
Romain Debordeaux
Director of Analytics • TheGuarantors

 

What is one of the most critical best practices your team follows when it comes to data observability, and why?

Our goal is to make our processes as transparent as possible to the data team but also to everyone at TheGuarantors, so we have implemented a few tools to help us get there.

First, we offer a data catalogue for anyone interested in using our self-serve reporting as a resource to better understand where data is coming from, where it is being used and how features are designed. Second, each of our job logs are monitored in a public Slack channel which allows the entire organization to know exactly when a process is down. Finally, we heavily rely on tests in our transformation jobs to verify that the ingested data has not changed since the last run. We mainly look for unexpected schema and accepted value changes.

 

What are some of the tools your team is using to streamline and automate data observability and monitoring? And what made you decide to use these tools over other options on the market?

Our transformation tools are dbt and Databricks — all pipelines created on those platforms contain some sort of test which ensures we are only processing accepted values. Our data catalogue is Castor, a startup operating out of France with a good track record of feature releases and a responsive team. One of the benefits of Castor is that they allow all users to improve field descriptions, which is beneficial because as their team mentioned to us: “There is rarely a problem of too much documentation.”

While not all of the tools I mentioned were initially designed for data observability, all of them have developed some strong features that provide the team with a holistic picture.

Our goal is to make our processes as transparent as possible to the data team but also to everyone at TheGuarantors.”

 

What’s the biggest challenge your team has faced with data observability?

Being consistent with our documentation and making sure all projects that go into production include adequate monitoring tests and have received proper unit testing. 

As our team grew over the past year, we added third party tools to assist us with this process. The goal is to enable our engineers with easily implemented solutions that do not distract from the core project objectives.

 

Responses have been edited for length and clarity. Photography provided by associated companies and Shutterstock.