Reality can turn on its head in an instant. If 2020 proved anything, it’s that circumstances can change quickly, which means that it’s increasingly important for businesses to access clean data in real time. And for companies that use machine learning, bad data in a pipeline is like putting sugar in the tank of a well-oiled machine.
The International Data Corporation forecasted that more than 59 zettabytes of data would be generated globally in 2020. Meanwhile, Forbes estimated that 2.5 quintillion bytes of data is created each day. Due to sheer volume, data environments are becoming more complex to manage, yet organizations need higher data velocity and reliability to power machine learning initiatives.
And the stakes are high. Gartner research has found that organizations believe poor data quality to be responsible for an average of $15 million per year in losses. How can organizations ensure they have clean data, quickly?
Enter DataOps. This discipline borrows Agile principles to align data scientists and analysts with developers and operations teams. The result is improved data flow throughout an organization, and ultimately, better outcomes for machine learning initiatives. The DataOps methodology is here to stay, and it is expected to transform big data in the same manner that DevOps transformed software development. Companies that streamline pipelines now are set up for success as data continues to explode.
Mulberry is a New York City-based company building consumer confidence in e-commerce through embedded product protection plans. Built In checked in with Mulberry to understand how the company successfully developed and implemented a DataOps strategy, and how that strategy has helped support their machine learning efforts.
What first prompted your team to adopt a DataOps strategy for machine learning? What were you hoping to gain or improve?
We generate and maintain a lot of raw data that powers our machine learning algorithms. All this data has to live somewhere, and we store ours across a variety of locations, such as relational databases (for structured data); NoSQL databases (for non-relational or graph-type data); and object storage (for semi-structured and unstructured data). And that’s only the raw data!
Since machine learning algorithms are only as good as the data that feeds them, and ultimately thrive on high-quality data, there is a necessity for converting all this raw data into something clean and tangible that can be consumed by our machine learning systems. Our original ETL process was slow, inefficient, ad-hoc and overall inconvenient. On top of that, the raw data was ever-growing and it was becoming increasingly daunting to work with to produce analytic results or develop machine learning solutions. As the time needed to prep the data for machine learning model training ballooned (and the appetite for machine learning-driven automation within the business grew) we had to look for ways to level up our overall data strategy.
We had to design a data strategy that would solve for today and also provide scalability for the future.”
What were the key considerations or steps you took to implement a DataOps model?
When we started the company, I had to consider whether DataOps would be overkill for the amount of data we expected to generate. Although we didn’t anticipate having “big data” within the first 12 months or so, we designed our applications, data pipelines, tooling and instrumentation with big data in mind. Despite the forethought, we were still surprised by the volume, velocity, and variety of data that we ultimately ended up having to work with. We had to design a data strategy that would solve for today and also provide scalability for the future.
Here are some considerations we kept top-of-mind with the new strategy:
- In the same way DevOps increases feature velocity, we wanted DataOps to increase our data consumption velocity.
- We treated DataOps like a factory: With lots of people and tools, how do we solve for high throughput and low error rates?
- In the execution step, it is critical to get the ETL and data pipelines right.
- The ultimate outcome is extreme automation of data.
- Our definition of success was simple: Has data throughput increased, and has it provided a meaningful lift to data consumption? (e.g., Can we train more machine learning models in less time?)
What’s the biggest efficiency or benefits your team has seen as a result of adopting DataOps?
The purpose of DataOps is to increase the availability of clean, readily-consumable data. More specifically, it is to increase analytic and data consumption velocity, and ultimately to create faster outcomes for consumers of latent insights and AI-driven automation. Since a big subset of our data is now properly pipelined and available at the push of a button, we’ve seen an appreciable impact on our machine learning model development timelines — especially in the places where we experienced stagnation due to the colossal amount of raw data and a lack of a scalable, long-term data strategy. Ultimately, building data ecosystems is hard, but well worth investing in.