4 Data Engineering Pros Unpack Why They Enjoy the Field

Leveraging a variety of technical skills to bring about impactful solutions is all in a day’s work.

Written by Stephen Ostrowski
Published on Feb. 22, 2022
Brand Studio Logo

Witnessing the imprint of one’s work might not always be an overnight experience. But Vikram Thirumalai hasn’t had to wait around for the import of his efforts as a data engineer to become evident in his role.

“I’ve seen the impact of my work since day one,” the Newsela team member said.  

So what does it take to score a role where ones’ effect on the business is so immediately apparent? Data engineering pros from four tech companies mentioned adroitness in a variety of languages, tools and more as coalescing around on-the-role success. As Esha Bhide, platform engineering associate at iCapital, said, “A data engineer needs to master a wide array of skills.”

Rafael Pacas, a data engineer at Justworks, noted that having a broader, business-focused view is helpful in the role, as well.

“In order to effectively execute the job, data engineers must increase their breadth of knowledge about the business domain, so that we become more familiar with the business side of things and interact with more stakeholders across all business functions,” Pacas said. 

Collaboration plays a similar role for data engineers. Jack Tsai, data engineer at Meetup, alluded to that component of a data integration project he’s currently working on. 

“The first stage of the project involves a lot of exploration and discussion,” Tsai said. “I enjoy the process of working with colleagues across the organization and providing data-driven solutions to our team.”

Want to know more about what it’s like to be a data engineer? Below, four professionals provide a glimpse into how they got into the field, the skills they frequently flex, and the stimulating work to which they’re currently applying their expertise and experience. 

 

Image of Esha Bhide
Esha Bhide
Platform Engineering Associate • iCapital

 

What led you to data engineering?

I started my career working as an implementation engineer for a large-scale enterprise resource planning software provider. At the outset, I realized the importance of reliable, scalable and maintainable systems. Later, when I pursued my master’s degree at NYU, I studied distributed systems, big data and machine learning systems, and data analytics. This fueled my fire to pursue a career as a data engineer. The primary challenge of most modern applications is not computing capacity or network latency but how the underlying data is managed, stored and optimized. Data engineering addresses these issues methodically.

 

How does data engineering differ from more traditional software development, and what are the key technical skills that you use most often during your workday? 

Unlike traditional software engineering where one is expected to master a single skill, data engineering involves wearing many hats and being proficient in multiple disciplines. In order to meaningfully run analytics on an online transaction processing system, you will need to design and implement a data warehouse. You will be required to use workflow management tools such as Airflow and Argo to facilitate extract, transform and load. You will be required to work with various cloud computing services, such as those offered by Amazon Web Services. Strong proficiency in Linux is required — writing Bash scripts, data lookups, and finding and analyzing patterns are some of my daily tasks.

 

Data engineering involves wearing many hats and being proficient in multiple disciplines.”

 

Tell me about a project you’re working on right now, including the goal, your approach to the problem and challenges you’ve encountered.

I am currently working on a project in which we are integrating our data warehouse with Salesforce and Metabase, a business intelligence reporting tool. It involves identifying both key data objects and relationship mappings and converting them into Star or Snowflake schemas as prescribed by the data warehouse. We are also building an in-house architecture to reach Salesforce through REST, which will allow us to scale at will and remove dependency on any external tools for connections. The goal of the project is to unify data and make it available to key stakeholders. In my role, I get the opportunity to collaborate with DevOps, application developers and data scientists. It’s interesting and exciting to be at that intersection.

 

 

Image of Jack Tsai
Jack Tsai
Data Engineer • Meetup

 

What led you to a career as a data engineer?

Funny enough, it started with online shopping. I love to find the best price when I shop online, so I wrote my own scraper to collect data from different websites in the days before Google Shopping existed. This side hustle involved collecting structured and unstructured data, cleaning the data, transforming it, and turning it into a report that would alert me to the best price. 

What started as a hobby to find the best deal on AirPods turned into something really satisfying. I believe building and designing data infrastructures can bring value and make an impact on different industries.
 

How does data engineering differ from more traditional software development, and what are the key technical skills that you use most often during your workday? 

It is hard to separate data engineering and traditional software development, because there is some overlap between the two fields and it can vary company by company. In my experience, data engineering requires both programming skills and knowledge of complex data processing systems that can handle large amounts of data.

Data engineers build the pipelines for storage while software engineers are responsible for capturing data across the product. Data engineers are also responsible at times for the cleanliness of the data and adding business logic to make data easier to manipulate.

I use Python and Scala as my programming languages. For workflow management, I use platforms like Airflow. Also, I spend most of my time on Amazon Web Services, including EMR, S3, Redshift, EC2, ECS, Glue, SQS, RDS, Rekognition, CloudWatch and Lambda. For large-scale data, I use Spark, the data warehouse Redshift and column-oriented storage solutions like Parquet.

 

Building and designing data infrastructures can bring value and make an impact on different industries.”
 

Describe a project you’re working on right now, including the goal, your approach to the problem and challenges you’ve encountered.

Google Analytics provides a lot of useful information for any business. Integrating GA data with our internal data sources has helped us glean profound insights.

The project I am working on involves integrating GA data with our data warehouse so that staff can easily find more insights from it. The high-level solution will be extracting, transforming and loading data from Google BigQuery to our data warehouse using workflow management platforms, data storage services and data warehouse services.

The challenge of this project will be defining what data we need and then figuring out how to extract it from Google Analytics using Google BigQuery. Instead of dumping all the data from GA, I need to work with the business and product teams at Meetup to understand what questions they want to answer with data and how we can find those answers in GA.

 

 

 

 

Justworks office
Justworks

 

Image of Rafael Pacas
Rafael Pacas
Data Engineer • Justworks

 

What led you to a career as a data engineer?

Becoming a data engineer was a combination of what I enjoy most as an engineer, the type of problems faced in this domain and the modern set of mature tools in the space. When presented with a problem, I always research and work to understand all of the nuances before creating a solution. 

For example, when moving data from a source to a warehouse, an engineer is faced with many challenges hidden under the mental map of the problem. This is seen in the case of third-party API sources. It’s very common for these sources to limit your number of requests per day or second. In this situation, scalability and robustness are priorities. Regarding scalability, it must have the ability to quickly move hundreds or billions of tuples of data without dropping any of it and still respecting the service limitations. Regarding robustness, it must handle network disconnections, request throttling and retrying failed payloads. My best value proposition is delivering quality data as soon as possible to our end users. This is why researching what tools are available in an infrastructure stack and determining how to piece them together to create a robust, reliable, scalable and timely pipeline is what I enjoy most.

 

How does data engineering differ from more traditional software development, and what are the key technical skills that you use most often during your workday? 

There are a few key differences, but for the most part, your traditional engineering experience is fully transferable. It all depends on the maturity of a data services team. For some organizations at the early stage where the team is only one data engineer who acts as their own product manager, infrastructure architect and QA engineer, more coding is required to achieve the development and maintenance of data pipelines. In this situation, independent, strong coders who love communicative code will be extremely valuable to the team and business.

As a team begins to grow, the differences are more easy to spot. At this stage, data engineers need to match their technical skills with their communication skills and become a specialist in transpiling business talk into data infrastructure and modeling decisions. We also attend more meetings!

 

Data engineers must increase their breadth of knowledge about the business domain.”

 

Describe a project you’re working on right now, including the goal, your approach to the problem and challenges you’ve encountered.

My current project is refactoring a five-year-old custom Ruby extract, transform and load pipeline. Justworks is dependent on this data for reporting and analysis of our sales funnel and more. Our current approach raises several issues: We replace Salesforce object data every three days, going against the best practice of treating raw data as immutable. Additionally, refreshing object data may propagate column deletions, removing historical data that could be serving a purpose in metrics, dashboards or analysis downstream. A column should only be removed or hidden as part of the transformation layer. Lastly, the legacy script is extracting, transforming and loading data for all tables in Salesforce into our warehouse, which means that maintaining anything in this script runs the risk of affecting each aspect of the pipeline. Splitting concerns helps our team maintain this pipeline by reducing risk and resources used. The goal for refactoring is to address these three main issues while adopting modern tools and instituting more industry standard processes and patterns.

 

 

 

Newsela
Newsela


 

Image of Vikram Thirumalai
Vikram Thirumalai
Data Engineer • Newsela

 

What led you to a career as a data engineer?

Before Newsela, I had the opportunity to be an advanced analytics consultant. I really enjoyed analyzing the moving parts of a data ecosystem to understand the patterns within. That aspect, combined with an affinity for software engineering, prompted me to explore data engineering, the best of both worlds. 

One of the big selling points was the opportunity to work with all types of data and analysis, while also being more social and collaborative in your work. You have stakeholders outside of your own team who use what you’re building, and you have to work closely with them to understand how to build a solution to their problems. I also love that it’s an opportunity for me to get closer to data science.

 

How does data engineering differ from more traditional software development, and what are the key technical skills that you use most often during your workday?

Data engineering is different because your impact is immediately tangible and direct. Newsela’s data engineering team is responsible for provisioning the data that feeds into the reports and dashboards that our executive, product and go-to-market teams use to make critical business decisions. It’s really rewarding to build things that my colleagues use on a daily basis and to be able to help drive our company forward through data. 

With data engineering, there’s a big emphasis on understanding the data you’re working with, as well as knowing about the end users and finished product. At the same time, all the coding fun and challenges in software engineering remains. Understanding data quality is also crucial. You not only have to do integration tests and database regression testing, you have to test whether your code works and how it affects the end product. Experience with a workflow orchestration tool is also important because it simplifies the visualization and execution of our data pipeline and ETL processes, enabling us to create effective and efficient pipelines.

 

It’s really rewarding to build things that my colleagues use on a daily basis and help drive our company forward through data.”

 

Describe a project you’re working on right now, including the goal, your approach to the problem and challenges you’ve encountered.

We wanted to create and ingest daily snapshots of our transactional databases without having to store a full copy every day. The goal was to deliver a reliable, scalable, extensible and incremental data engineering solution. We started by identifying the source datasets whose history we needed to preserve. Then we focused on understanding the downstream use cases and testing rigorously for data quality to make sure there were no downstream impacts. 

Two challenges we faced were making sure the data quality testing was up to par and understanding how our data is used downstream, which we resolved through a lot of conversations with the data warehousing and business intelligence teams. I really enjoy the idea that we took something that was originally a third-party solution and replaced it with a custom-built, in-house solution. I also loved being able to pull in our entire MySQL database and work with such a large dataset. It’s fascinating to be able to touch the origins of our data and understand it a bit more, while also seeing how it gets transformed. I loved having a more intimate look at what’s behind the analytics products that we use every day.

 

 

Responses have been edited for length and clarity. Header image by WindAwake via Shutterstock. All other images via listed companies.