The Best Practices for Managing Your Data Lakes, According to This NYC Company

As data continues to grow and evolve — estimates of all of the world’s data are in the billions of terabytes —  the need to store both relational and non-relational data from sources such as mobile apps, IoT devices and social media has also grown and evolved. Built In NYC caught up with Matthew Meen, senior software engineer at Crossix, to learn how the healthtech company optimizes its data lake.

Written by Taylor Karg
Published on Dec. 15, 2020
Brand Studio Logo
data lakes
shutterstock

As data continues to grow and evolve — estimates of all of the world’s data are in the billions of terabytes —  the need to store both relational and non-relational data from sources such as mobile apps, IoT devices and social media has also grown and evolved. 

But storing that data can get messy. Hundreds of terabytes of raw data in a data lake can feel insurmountable to sort through, and if data isn’t organized, it can have an impact on the bottom line: According to a survey conducted by Aberdeen, companies who optimize both data warehouses and data lakes outperform similar companies who don’t by nine percent. 

Built In NYC caught up with Matthew Meen, senior engineering manager at Crossix, to learn how the healthtech company optimizes its data lake.

 

Image of Matthew Meen
Matthew Meen
Senior Engineering Manager • Veeva

Crossix is a health-focused tech company that helps clients advance their marketing techniques through analytics, planning, targeting, optimization and measurement solutions. In order to better organize and manage the data sent by publishers and partners, Crossix has begun the process of building its own processing tools, Senior Engineering Manager Matthew Meen said.  

 

Where does the data in your company’s data lake come from and what technology do you use to store it? 

We currently have approximately 150 terabytes of non-health data in our managed data lake, with a majority of it being ad exposures. This data comes from a wide variety of sources including tagged media, cable TV, over-the-top (OTT) video, site trafficking, ad servers and direct publisher feeds. The largest single dataset is around 45 billion records.

Inbound data is then either dropped into S3 by partners or routed there via AWS Transfer, AWS SES and direct API integrations. Processed data follows a ‘lake-house’ topology, with the ‘hottest’ data (around 50 terabytes) residing in Redshift cluster storage and the remainder offloaded to S3 as partitioned parquet datasets, which is accessible via Spectrum and EMR/Spark.
 

We’re productizing a data request layer which will make it simpler for data consumers to pull everything that’s relevant to their analysis.”


How does your team manage, organize and extract business value from all that unstructured data?

As we move toward a cookieless future, we are seeing a large uptick in the number of publishers and other partners who send us data directly, rather than via a tagging partner. The increased number of integrations has made it more attractive to begin building our own processing tools to dynamically generate ETL/ELT DAGs (for Airflow to execute on
EMR) based upon a metadata model — rather than have data engineers use off-the-shelf tools to build each individually.

 

More on Data LakesData Lake vs. Data Warehouse: Will the Already-Blurred Line Between Them Disappear?

 

The volume and complexity of data an analyst or data scientist needs to pull before they can even start their work is also increasing. To address this, we’re productizing a data request layer, which will make it simpler for data consumers to pull everything relevant to their analysis. This data will be determined using the same model that’s used during ingestion using a single API call that is integrated into their existing tools to produce privacy-safe health insights. 

Responses have been edited for length and clarity. Headshot provided by Crossix. Header image by spainter_vfx from Shutterstock.