Blogs· 6min December 15, 2022
The Form3 engineering department is divided into different groups, internally called "business lines". Each of these business lines is responsible for a separate part of engineering at Form3. Some lines are arranged around product such UK, Euro etc. and those teams build products on the platform. Other teams are responsible for cross-cutting concerns, such as the Platform Engineering teams that provide a platform for other engineering teams to build on top of. The Platform Engineering take care of a lot of important aspects such as providing a secure way to run workloads.
Form3 solve a problem that banks have, which is moving money between accounts at different banks. Looking at an example, let's say we need to move money between Bank A and Bank B. At a lower level, a couple of things have to happen:
The answer to all of these problems is a payment scheme. There are many types of payment schemes around the globe. In the UK, the two biggest ones are FPS and BACS. In Europe, there are the SEPA payment schemes.
These payment schemes solve the fundamental problem of moving money between banks, but they all do it in completely different ways. The aspects vary from: connection mode, message formats, error handling etc. Therefore, for Bank A to serve the UK market, they would have to build two separate integrations to the UK payment schemes in order to allow customers to send and receive money. This results in a lot of infrastructure to maintain, as some payment schemes require it. There is also maintenance burden as well, as payment schemes make changes to their APIs every year.
The Form3 proposition is a single, unified API that provides cloud-based and private connectivity options. All the scheme specifics are abstracted behind the API, allowing customers to integrate with the API once. Customers create a single resource that describes the intent of what they want to do and on which payment scheme. Internally, the Form3 platform then handles the mapping and all the other integration details with the desired payment schemes.
When Form3 started with 4 engineers, they made the decision that they wanted to build everything on the cloud, IaC and a containerised microservice architecture. This also had the benefit that it allowed the early team to offload as much responsibility as possible to the cloud vendor, allowing the Form3 engineers to focus on building products.
The primary services they chose to build the platform on were:
This design really served Form3 well and allowed the team to scale.
However, when some of the bigger banks wanted to move to the platform, the regulators wanted them to provide an exit strategy so that they are not coupled to any particular cloud vendor. The point of view of the regulators is that they don't want an outage to affect the UK economy. The bigger banks then pushed this requirement onto Form3.
There were a couple of directions the team could have gone to solve this issue:
At a high-level, the data storage technologies used by multi-cloud use RAFT under the covers. This means that the multi-cloud solution needs 3 places to store data to fulfill the requirement of not being dependent on any cloud. Due to this, the team decided to build the platform on the 3 biggest vendors: AWS, GCP and Azure.
The three clouds are then networked together on a Form3 private network, running a Kubernetes cluster in each. Then, a NATS JetStream cluster is spanned across the three clouds. Similarly, the Cockroach database is spanned across the clouds as well. This gives the product teams a consistent architecture to build against, only having to integrate with the cloud agnostic technologies. In fact, they don't even need to care about what cloud their workload is running in!
This multi-cloud solution means that the software is much easier to write and the team only need to maintain one version of their services.
The need for running in three clouds comes down to Form3's business requirement for high consistency: payments cannot be lost or duplicated. As previously discussed, CockroachDB uses RAFT. The team have set it so that two of three clouds have to agree on the write for it to be consistent. The quorum based consensus means that there must be an odd number of nodes and the majority have to agree. Due to this configuration, the team have built a multi-cloud solution on three clouds to ensure that writes can continue to happen in the case of a one cloud/node outage.
This is one of the most challenging aspects of running a multi-cloud architecture. The team had two high-level options. The first option was to connect through Internet-based connections, but its not ideal for sensitive payment data to travel through a public network and the internet gives no guarantees of network stability or latency. The better option was to use a private network between the clouds with guaranteed latency and SLAs. In order to facilitate that, the team made a connection between each of the clouds down to the data centers. The cloud vendors provide these possibilities out of the box.
With these connections in place, the team used private CIDR ranges and sub-divided them up between the clouds. The routers in the data centers, where the cables are coming in from the cloud, have routes set up to handle these ranges and send the traffic to the correct cloud based on the range they are looking for.
On the cloud side, the connection comes into a gateway and then onto a router which is set up to route the traffic on to the VPC inside the cloud. For example, if a pod in AWS wants to send a request to a pod in GCP, then it will send the request down to the data center, and the router in the data center will send that on to GCP and the infrastructure inside the GCP will forward that request further to VPC as configured.
One final detail is that the Kubernetes clusters are setup so that the pods get allocated IP addresses within the cloud VPC. A typical Kubernetes setup sees the pod IPs in a different CIDR range to the host network. In the multi-cloud configuration, the pods are in the same CIDR range as the host VPC.
The setup allows pods to communicate with each other across the clouds and underpins the ability to run CockhroachDB in the multi-cloud setup, where CockroachDB nodes need to be able to communicate to ensure consensus.
The new multi-cloud platform has been designed for big customers, so performance is an important aspect of the solution. The performance numbers are orders of magnitude faster than the previous solution, even though it is running across multiple cloud vendors. In general, the latency between the clouds is comparable to two Availability Zones within a single cloud.
CockroachDB solves the problem of consistency for the Form3 platform. This does sacrifice a little bit of performance, as an extra request to another cloud must be made on every write. This should only be used for the hot path payments processing functionality of the platform. Reporting and other types of processing will be made off of CockroachDB.
Looking back at the project, Kevin identifies the biggest challenges that he and the platform team have overcome:
The complex multi-cloud project really pushes the boundaries of what's possible in engineering and is a testament for the great engineering talent at Form3. The project has been running for nearly two years, due to go live in 2023.
Kevin encourages anyone who is interested in solving some of these problems to check the Form3 vacancies board.
Adelina is a polyglot engineer and developer relations professional, with a decade of technical experience at multiple startups in London. She started her career as a Java backend engineer, converted later to Go, and then transitioned to a full-time developer relations role. She has published multiple online courses about Go on the LinkedIn Learning platform, helping thousands of developers up-skill with Go. She has a passion for public speaking, having presented on cloud architectures at major European conferences. Adelina holds an MSc. Mathematical Modelling and Computing degree.
Blogs · 10 min
A subdomain takeover is a class of attack in which an adversary is able to serve unauthorized content from victim's domain name. It can be used for phishing, supply chain compromise, and other forms of attacks which rely on deception. You might've heard about CNAME based or NS based subdomain takeovers.
October 27, 2023
Blogs · 4 min
In this blogpost, David introduces us to the five W's of information gathering - Who? What? When? Where? Why? Answering the five Ws helps Incident Managers get a deeper understanding of the cause and impact of incidents, not just their remedy, leading to more robust solutions. Fixing the cause of an outage is only just the beginning and the five Ws pave the way for team collaboration during investigations.
July 26, 2023
Blogs · 4 min
Patrycja, Artur and Marcin are engineers at Form3 and some of our most accomplished speakers. They join us to discuss their motivations for taking up the challenge of becoming conference speakers, tell us how to find events to speak at and share their best advice for preparing engaging talks. They offer advice for new and experienced speakers alike.
July 19, 2023