Ben Cordero from Snyk discusses where we draw the line between platform and product. Then, he explains the Kubernetes operator pattern, which is a good developer experience. Finally, he touches upon what a good on-call rotation is and how to structure it without burning out engineers.
Ben is an experienced SRE, with a long track record of building and testing systems. Here are some of his key highlights on building and running platform, as well as his experiences being a platform engineer.
Companies should be focusing on hiring engineers to build product. Often, early stage startups decide not to have a platform team. As the company grows, from 10 to about 100 engineers, they begin to focus on hiring platform engineers to enable faster delivery.
There are lots of aspects a platform engineer can focus on: pipelines, build systems, testing or observability stacks. Having someone dedicated to these things can help engineers deliver more efficiently.
Having a consistent way to deliver workloads allows teams to achieve better efficiency.
The Kubernetes operator pattern is a good approach for application specific infrastructure. It gives us the following breakdown:
Stateful Kubernetes workloads are trickier to update. You could even go so far as to form specialised platform teams that rebuild them on different schedules than the stateless workloads
The industry trend seems to be towards serverless. We have Amazon EKS Serverless, AWS Fargate and even Amazon Aurora Serverless. This seems to diminish the need for a dedicated platform team as we push the platform responsibility for the cloud provider.
The cost efficiency of going serverless is useful for companies trying to optimise product for infrastructure. Another optimisation is the single tenant optimisation for data locality or risk averse customers.
There is a case to be made that if startups should serverless from the ground up today, instead of reaching for Kubernetes immediately. However, serverless principles are still relatively new, so you might have a hard time getting engineers with the in-depth knowledge required to build fully serverless. It’s therefore not straightforward to start with serverless.
Focus on new starters. New joiners to the company won’t know how to use your tools, solve incidents and run software in production. Make the new starter experience great. This will also make it easier for existing engineers to switch to a new codebase at your company and contribute immediately.
Platform teams should add the pain points of new starters to their roadmaps. They are product teams where the features are developer velocity and on-call incident diagnosis.
Most engineers are expected to be on-call and run platforms that are available 24/7. Keeping the product alive and your customers happy is where the focus and attention should be.
The Google SRE book, Site Reliability Engineering: How Google Runs Production Systems describes SLOs and alerting on symptoms that actually cause customers pain, as opposed alerting when the server goes down or on specific CPU.
Noisy or meaningless alerts burn out engineers.
If you’re writing code, you should be the one taking the pager for it. Platform engineers can never have enough context on how to fix a product feature bug.
Platform teams can help when there is something fundamentally wrong with the underlying platform, but product teams should have contingencies for outages.
If you enjoyed this episode and would like to be part of the podcast, then please fill in this form and we’ll be in touch. ✍️
Here are some other resources that you might find interesting:
Join host Kevin Holditch and AWS Developer Advocate Marcia Villalba who explains how we can solve 3 common architectures using AWS Serverless technologies.
In this episode, our host Kevin Holditch is joined by Aidan Grace, Solutions Architect at AWS, to discuss running containers in AWS. By drawing on Form3's own experience of utilising containers on AWS, Aidan and Kevin discuss the advantages and capabilities of different container set ups.