Kang Cao, a software and data engineer at home insurance marketplace Young Alfred, has an important message for budding data engineers: “Understand that not all data is equal.”
As businesses become increasingly data-driven, data engineers are an integral part of their success. According to the 2020 Tech Job Report by DICE, “data engineer” was the fastest-growing role of 2019, with 50 percent year-over-year growth. This trend signals that more businesses in industries outside of traditional tech and finance are creating their own data engineering positions to unlock the power of their data.
But according to Cao, industries new to data engineering often harbor disorganized datasets.
Cao likened working with sparse or undermined data in rapidly changing industries like home insurance to “discovering a new world.”
Below, Cao told Built In NYC about his data mining process and how he builds models from untraditional datasets to produce new algorithms, such as the recently launched home insurance calculator.
Kang Cao, describe a typical day for you.
When working on a data science project, we adapt from the general data mining methodology CRISP-DM (cross-industry standard process for data mining) to a spiral development process. The six stages of that process are as follows: business understanding, data understanding, data preparation, modeling, evaluation and deployment.
After defining the scope and brainstorming, data engineers are required to collect data for exploration. Data sources can be various and scatter in numerous corners in production environments, not to mention metadata. Then engineers can work on data integration, including data standardization, data categorization and persistence. It is hard to know how to standardize data, including table naming, label categorization, table usage and whether to increment data volume. It may require assistance from business intelligence. After figuring out what our data warehouse looks like, it is easy to provide the functionality of transforming and providing data for an application.
From there, we ask ourselves if we should provide a RESTful API or a stream-engine Kafka. It can be flexible based on the needs of our analyst teammates. When building the platform, we value performance, clean code and crafted query.
When it comes to deployment, engineers have an important role in converting the model into a product. We might need a packaging model into binary when there is a difference in language between research and production environment. Suppose the model does not provide a solution of extracting and processing scalable data. In that case, we might need extra layers such as wrapping Python functions into a cron or streaming job with a cache layer. In most cases, monitoring mechanisms are required to locate problems ignored in the development stage and detect covariant changes, which are any changes in distribution of source data of the running model.
How Young Alfred Uses Data
Tell us about a project you’re working on right now that you’re excited about.
Last month, we released Young Alfred’s home insurance calculator to estimate a home insurance premium using a limited number of input data. A comprehensive home insurance calculation requires over 100 factors, but people often don’t want to enter 100 pieces of information to get a ballpark figure. Adding to the complexity, those 100 factors can change over time and change from one zip code to another. There are more than 42,000 ZIP codes in America and each one can have a different custom calculator to estimate home insurance premiums.
Rather than starting with a clustering algorithm, we chose to pick a simpler model as a baseline and improve on it over time. We built a simple model framework with 42,000 sets of model coefficients.
We finalized on a model that cannot guarantee it will get all 10 input factors and still perform while minimizing squared error. While we evaluated over a dozen types of models to arrive at our first implementation, it’s fun to know that the one we implemented was grounded in simplicity and still leaves room for improvement. As a starting point, it is still more powerful than anything else out there.
Drawing insights and predictions from this undermined data is similar to discovering a whole new world.”
What’s one thing that might surprise people about your role as a data engineer at your company?
Understand that not all data is equal. Datasets are as diverse as people. Most of the established tech and finance industries work with beautifully cleaned and normalized data because the industry is mature and the value of a basis point in that industry is so high. However, many industries are filled with disorganized data that is highly categorical and incredibly sparse. Drawing insights and predictions from this undermined data is similar to discovering a whole new world.
More than data engineers, we are data explorers. We handle the data everyone else said was too hard to parse or model, and that is what excites the best data engineers.