.tech Podcast - Role of a platform engineer

Ben Cordero from Snyk discusses where we draw the line between platform and product. Then, he explains the Kubernetes operator pattern, which is a good developer experience. Finally, he touches upon what a good on-call rotation is and how to structure it without burning out engineers.

Ben is an experienced SRE, with a long track record of building and testing systems. Here are some of his key highlights on building and running platform, as well as his experiences being a platform engineer.

Drawing the line between platform and software

Companies should be focusing on hiring engineers to build product. Often, early stage startups decide not to have a platform team. As the company grows, from 10 to about 100 engineers, they begin to focus on hiring platform engineers to enable faster delivery.

There are lots of aspects a platform engineer can focus on: pipelines, build systems, testing or observability stacks. Having someone dedicated to these things can help engineers deliver more efficiently.

  • The platform team really can take care of anything that falls in between teams and that teams share. However, teams should be involved in the design and development of these shared pieces of functionality.
  • Communication is key to make sure that the platform team is able to support teams in delivering as fast as possible.

Having a consistent way to deliver workloads allows teams to achieve better efficiency.

  • At Form3, our teams get Terraform workspaces that they are responsible for maintaining. The platform team are then responsible for keeping the underlying clusters running.
  • At Snyk, the Helm chart itself is the hand off point between platform and product teams. Any namespace Kubernetes resource can be opened up as the platform API.

The Kubernetes operator pattern

The Kubernetes operator pattern is a good approach for application specific infrastructure. It gives us the following breakdown:

  • From the developer point of view, you have a small YAML file that describes the resources and workloads.
  • Outside of that module, you have the URL that you need to connect to, automatically provisioned credentials, metrics, dashboards, alerts etc. hidden behind the API abstraction.
  • The implementation of that abstraction is done by the in-house platform team. They can do upgrades and migrations without any changes to API or disruptions to the product team.

Stateful Kubernetes workloads are trickier to update. You could even go so far as to form specialised platform teams that rebuild them on different schedules than the stateless workloads

Serverless

The industry trend seems to be towards serverless. We have Amazon EKS ServerlessAWS Fargate and even Amazon Aurora Serverless. This seems to diminish the need for a dedicated platform team as we push the platform responsibility for the cloud provider.

The cost efficiency of going serverless is useful for companies trying to optimise product for infrastructure. Another optimisation is the single tenant optimisation for data locality or risk averse customers.

There is a case to be made that if startups should serverless from the ground up today, instead of reaching for Kubernetes immediately. However, serverless principles are still relatively new, so you might have a hard time getting engineers with the in-depth knowledge required to build fully serverless. It’s therefore not straightforward to start with serverless.

Reducing the friction of deploying to production

Focus on new starters. New joiners to the company won’t know how to use your tools, solve incidents and run software in production. Make the new starter experience great. This will also make it easier for existing engineers to switch to a new codebase at your company and contribute immediately.

Platform teams should add the pain points of new starters to their roadmaps. They are product teams where the features are developer velocity and on-call incident diagnosis.

Running on-call

Most engineers are expected to be on-call and run platforms that are available 24/7. Keeping the product alive and your customers happy is where the focus and attention should be.

The Google SRE book, Site Reliability Engineering: How Google Runs Production Systems describes SLOs and alerting on symptoms that actually cause customers pain, as opposed alerting when the server goes down or on specific CPU.

Noisy or meaningless alerts burn out engineers.

If you’re writing code, you should be the one taking the pager for it. Platform engineers can never have enough context on how to fix a product feature bug.

Platform teams can help when there is something fundamentally wrong with the underlying platform, but product teams should have contingencies for outages.

Interested in being a guest speaker?

If you enjoyed this episode and would like to be part of the podcast, then please fill in this form and we’ll be in touch. ✍️

by Adelina Simion Technology Evangelist

Further resources

Here are some other resources that you might find interesting:

.tech Podcast - AWS Serverless Patterns

Join host Kevin Holditch and AWS Developer Advocate Marcia Villalba who explains how we can solve 3 common architectures using AWS Serverless technologies.

.tech Podcast - Running Containers on AWS with Aidan Grace

In this episode, our host Kevin Holditch is joined by Aidan Grace, Solutions Architect at AWS, to discuss running containers in AWS. By drawing on Form3's own experience of utilising containers on AWS, Aidan and Kevin discuss the advantages and capabilities of different container set ups.

.tech Podcast - All about Terraform with Anton Babenko

This time Anton Babenko joins our host Kevin Holditch for an episode on Terraform. Terraform is one of the leading tools designed to make managing infrastructure as code across public clouds easier for developers.