After their six month anniversary, our senior engineers take on the extra responsibility of joining the oncall rota. We know this is a crucial aspect of engineering life and it is essential that everyone on the engineering team has good oncall experiences! This page gives you some detailed insight into our approach to oncall at Form3.
We operate under a true DevOps culture, so our engineers are responsible for building, maintaining and supporting our high volume platform. We firmly believe that supporting a platform oncall is a great learning experience in how to debug and fix a complex system, at scale.
In general, the responsibilities of the engineer oncall are:
Responding to incidents on our production and staging environments.
Identify and fix errors from the backlog to contribute to platform stability.
Supporting customer queries regarding setup or functionality.
Our engineers are in charge of maintaining the monitoring and alerting of their services, ensuring their services are not noisy and that their alerts are meaningful and actionable.
First off, let's cover some of the tooling we use for our oncall rotas. All of our engineers have company phones (currently iPhones). All necessary company software and accounts available on these devices.
The main tools that engineers use oncall are:
We are always improving our documentation and tooling to ensure that it is up to highest standards.
It's important to us that our engineers never feel alone, pressured or unsupported when they're oncall. We have dedicated incident managers to support the engineer oncall throughout the lifecycle of an incident.
Typically, our incident process consists of:
The incident manager and relevant engineer oncall get paged. They begin the initial investigations, begin assessing customer impact and coordinate any customer communication that might be required.
The incident manager identifies next steps to take in the runbooks. With support from the incident manager, the engineer begins to investigate the issue. The incident manager is in charge of the required communications.
If needed, other engineers are brought in to support incident resolution. In particular, production fixes are only done using timeboxed credentials and in pairs, so the engineer oncall does not have to make changes in production alone.
Finally, once the incident is resolved, the postmortem process begins. A detailed no blame investigation of the incident cause and response is conducted. Changes to service, alerting and runbooks are identified and prioritised.
Our platform runs critical payments for our clients, so we provide round the clock oncall support. Everyone from the engineering group takes part in the rota - including our leads, heads of engineering and the executive team!
We divide oncall shifts between office hours and out of hours shifts. During office hours, the engineer oncall has the support of the entire team, but is the first point of contact in the case of incident or error. Out of hours shifts run on a daily rotation, while office hours shift can be longer, depending on the team.
As out of hours shifts are less convenient than office hours shifts, these are remunerated on top of engineer base salary. Engineers are paid for making themselves available for a shift, even if they are not paged. We also give our engineers time in lieu if they are called out for a longer time to fix a challenging issue.
The frequency of the oncall shifts for each engineer varies with team size. Our heads of engineering are constantly reviewing schedules and teams to ensure that we hit the correct balance of team size and oncall frequency. This is a constant work in progress.
Most engineers currently take on a shift every 2 weeks. The team members support each to accommodate for annual leave and personal schedules. Swapping shifts or a part of a shift is normal and easy to do with PagerDuty.
We provide support and training for engineers going on call. While the process varies across our teams, the oncall onboarding generally follows this structure:
New oncall engineers shadow their experienced colleagues when they finish onboarding to the team. This gives them the opportunity to get used to being on call as well as learn more about our platform.
After getting plenty of shadowing experience, the engineers usually join the oncall rota by taking on office hours oncall shifts. This gives them the experience of being the primary point in the case of an incident, while still getting the support of their team members.
When they feel ready, the engineers begin to take on out of hours oncall shifts. However, as we mentioned previously engineers are never alone on their shifts.
Blogs · 10 min
A subdomain takeover is a class of attack in which an adversary is able to serve unauthorized content from victim's domain name. It can be used for phishing, supply chain compromise, and other forms of attacks which rely on deception. You might've heard about CNAME based or NS based subdomain takeovers.
October 27, 2023
Blogs · 4 min
In this blogpost, David introduces us to the five W's of information gathering - Who? What? When? Where? Why? Answering the five Ws helps Incident Managers get a deeper understanding of the cause and impact of incidents, not just their remedy, leading to more robust solutions. Fixing the cause of an outage is only just the beginning and the five Ws pave the way for team collaboration during investigations.
July 26, 2023
Blogs · 4 min
Patrycja, Artur and Marcin are engineers at Form3 and some of our most accomplished speakers. They join us to discuss their motivations for taking up the challenge of becoming conference speakers, tell us how to find events to speak at and share their best advice for preparing engaging talks. They offer advice for new and experienced speakers alike.
July 19, 2023