What Motivates Companies to Build Site Reliability Teams

Why do companies need SRE and what happens once that team is built? These tech leaders have the answers.

Written by Michael Hines
Published on Feb. 03, 2022
Brand Studio Logo

Following Google’s lead is a pretty good call for most tech companies, at least when the goal is to build a product or service used by billions. This manifests in a variety of ways, with one of the most recent being the increased adoption of site reliability engineering (SRE). 

Another wise move, this one being for engineering leaders, is to ask why a company needs a site reliability team in the first place as well as how that team would be built. While reliability is a universal goal, it can mean different things to not only different companies but also during different stages of a company’s growth. Yes, the skills required to be an SRE overlap considerably with those of developers and DevOps engineers. But the role itself is a bit of a grab bag of tasks. It includes but is not limited to putting out fires, creating automation tools and working across teams and departments to ensure that reliability isn’t being sacrificed for the sake of speed.

How do you know if SRE is worth investing in? We spoke with tech leaders at three local companies that have created site reliability teams to answer that question. We asked each about what drove the decision to create an SRE team along with the benefits and trade-offs that come with taking this path and what they look for when hiring SREs.

 

The Ultimate SRE PrimerWhat Is a Site Reliability Engineer? What Does an SRE Do?

 

Image of Karim Ali
Karim Ali
Director of Production Engineering • Galaxy Digital

What prompted you to create a site reliability team within your engineering organization?

There are two key reasons why an organization needs a site reliability team: to ensure that services have a high degree of availability and to enable developers to ship code as efficiently as possible. These factors have become increasingly important as Galaxy grows both our product offerings and engineering team.

Crypto is a 24/7 market and downtime is not acceptable. Additionally, as we increase things like the number of users or the number of supported exchanges, platform scalability becomes the top priority. As we hire more engineers and our book of work expands, a world without CI/CD becomes untenable. With the site reliability team, the engineering team can meet the needs of the business and in turn our clients.
 

Our site reliability team is instrumental in allowing the business to be agile.
 

What advantages or improvements have you seen since implementing site reliability engineering, and how do you deal with any potential drawbacks of the approach?

Our site reliability team is instrumental in allowing the business to be agile. Crypto is a rapidly evolving landscape that requires constant reevaluation of priorities. Our SRE team has made significant efforts to move toward infrastructure as code, which allows us to change direction and tackle new priorities quickly. It makes it easy to do things like spin up new services by streamlining processes and putting in place consistent frameworks for managing our infrastructure. 

While this expedites development where we are using existing patterns, introducing new aspects of our infrastructure or making significant changes can be slightly more expensive than if you were putting together a one-off solution given that you can’t just click a button to change something. It has to go through the full development lifecycle. However, this is a trade-off we are willing to accept in favor of a more maintainable and scalable system.

 

How is your site reliability team structured, and what are the skills you look for in a good SRE engineer?

Our team is relatively flat, as is our entire organization. The majority of the team handles project work and roughly a quarter of the team is focused on day-to-day asks from developers, like standing up instances and security group changes. When building my team, I look for three key skills; first, a fundamental understanding of how Linux and containerization works; second, expert knowledge of at least one programming language, preferably Python; and third, a fundamental understanding of infrastructure as code. 

That said, skills are teachable if the underlying abilities are there. It’s important to understand how people think and how they problem-solve and get through blockers. Having static skills is not enough. They need to be able to learn and grow with the team and evolve their skill set with the ever-changing landscape of technology.

 

 

Image of Todd Resudek
Todd Resudek
Staff Engineer • Jackpocket

What prompted you to create a site reliability team within your engineering organization?

As our product continued to grow, the need for expertise around optimizing our infrastructure became more apparent. Building a team with both the technical background and company mandate to scale, manage and innovate our infrastructure became important to our long-term health as an organization.
 

Soft skills go a long way in being successful.

What advantages or improvements have you seen since implementing site reliability engineering, and how do you deal with any potential drawbacks of the approach?

Not only did our SRE team improve reliability, they also gave us much greater insight into our systems, which created actionable feedback for our application developers. One of our challenges is finding the sweet spot where application engineers and SREs meet. It is common for developers to throw the work over the proverbial wall to the infrastructure team, but we really wanted to make sure the teams were working together to get the best outcomes. Giving our SREs knowledge of the application code, and vice-versa, is the only way to do that.

 

How is your site reliability team structured, and what are the skills you look for in a good SRE engineer?

The SREs work together to tackle large-scale initiatives and organization-wide changes. But each SRE is also embedded in our application teams while their service is in active development. This was done so decisions that affect infrastructure can be adjusted at all phases, not just at the end of the delivery cycle.

A good SRE should not only be technically proficient but also a constant learner. Like much of engineering, technology changes rapidly. So, what we are doing today will likely not be the best solution next year. They also need to have strong communication skills. The role requires interacting with a lot of teams and individuals, and soft skills go a long way in being successful.

 

 

Image of Daniel Seravalli
Daniel Seravalli
Engineering Manager • Holler

What prompted you to create a site reliability team within your engineering organization?

The biggest benefit of having dedicated operations engineers is that they allow other software engineers in the company to focus more on product development. And over time, the tools and practices that an SRE team produces speed up product development and increase reliability. For us, we had also arrived at an inflection point where we needed to rapidly transition our cloud resources from scrappy startup mode to well-structured best practices, and we knew it would require professionals to pull it off.
 

Dedicated operations engineers allow other software engineers in the company to focus more on product development.
 

What advantages or improvements have you seen since implementing site reliability engineering, and how do you deal with any potential drawbacks of the approach

We’ve seen observability and monitoring become first-class citizens in our engineering organization and have been able to stand up cloud architectures that would have been very difficult and time-consuming previously. We’ve also greatly reduced our exposure to security vulnerabilities as well as single sources of failure.

 

How is your site reliability team structured, and what are the skills you look for in a good SRE engineer?

Other than myself, the team is made up of two more engineers: a lead DevOps engineer and a senior SRE. While looking for talent to comprise a small ops team, it was important for all team members to have deep, senior-level knowledge in places such as AWS, Kubernetes, networking and Linux. More broadly, Holler engineering looks for self-starters with a mentality of “we’ll figure it out.” The drive to learn new things and find solutions to problems without hand-holding is key.