.tech Podcast - Building multi-cloud at Form3

Blogs· 6min December 15, 2022

Kevin Holditch is Head of Platform Engineering at Form3. He joins us to share the interesting problems the Platform Engineering team work on and how the Form3 payments platform was built. Then, he explains why the team decided to build a multi-cloud platform across three clouds and presents an overview of how the technologies behind this exciting project are configured.

Kevin Holditch is Head of Platform Engineering at Form3. Kevin leads and looks after the Platform Engineering teams, who have been working on the Form3 multi-cloud platform.

The platform team at Form3

The Form3 engineering department is divided into different groups, internally called "business lines". Each of these business lines is responsible for a separate part of engineering at Form3. Some lines are arranged around product such UK, Euro etc. and those teams build products on the platform. Other teams are responsible for cross-cutting concerns, such as the Platform Engineering teams that provide a platform for other engineering teams to build on top of. The Platform Engineering take care of a lot of important aspects such as providing a secure way to run workloads.

Form3 solve a problem that banks have, which is moving money between accounts at different banks. Looking at an example, let's say we need to move money between Bank A and Bank B. At a lower level, a couple of things have to happen:

  • First, the account holder at Bank A will need to instruct their bank to send money to an account holder at Bank B.
  • Next, Bank B will need to validate that the they have such an account holder and accept that the money will come in.
  • Finally, once both parties have accepted, they need a way to transfer the money between each other.

The answer to all of these problems is a payment scheme. There are many types of payment schemes around the globe. In the UK, the two biggest ones are FPS and BACS. In Europe, there are the SEPA payment schemes.

These payment schemes solve the fundamental problem of moving money between banks, but they all do it in completely different ways. The aspects vary from: connection mode, message formats, error handling etc. Therefore, for Bank A to serve the UK market, they would have to build two separate integrations to the UK payment schemes in order to allow customers to send and receive money. This results in a lot of infrastructure to maintain, as some payment schemes require it. There is also maintenance burden as well, as payment schemes make changes to their APIs every year.

The Form3 proposition is a single, unified API that provides cloud-based and private connectivity options. All the scheme specifics are abstracted behind the API, allowing customers to integrate with the API once. Customers create a single resource that describes the intent of what they want to do and on which payment scheme. Internally, the Form3 platform then handles the mapping and all the other integration details with the desired payment schemes.

A look back at the 2016 platform

When Form3 started with 4 engineers, they made the decision that they wanted to build everything on the cloud, IaC and a containerised microservice architecture. This also had the benefit that it allowed the early team to offload as much responsibility as possible to the cloud vendor, allowing the Form3 engineers to focus on building products.

The primary services they chose to build the platform on were:

  • AWS ECS for running Docker containers.
  • AWS SQS & AWS SNS asynchronous processing of payments.
  • PostgreSQL running on AWS RDS for managing backups, updates and keeping the database running.

This design really served Form3 well and allowed the team to scale.

Building multi-cloud

However, when some of the bigger banks wanted to move to the platform, the regulators wanted them to provide an exit strategy so that they are not coupled to any particular cloud vendor. The point of view of the regulators is that they don't want an outage to affect the UK economy. The bigger banks then pushed this requirement onto Form3.

There were a couple of directions the team could have gone to solve this issue:

  • They could have picked another cloud and all the proprietary technologies needed. For example, they could have picked Google and all the corresponding services that they provide. However, this would have required: a full rewrite of the platform, maintaining two different versions of our services and a Big Bang migration of all the services before any single payment could be processed on the new solution.
  • They could have picked technologies that could run on any cloud to replace the proprietary they were using. These are: KubernetesNATS JetStream and CockroachDB. This allowed the team to build a Form3 platform that works on any cloud and deliver services on a rolling basis, avoiding the Big Bang migration required of the previous approach.

Multi-cloud architecture

At a high-level, the data storage technologies used by multi-cloud use RAFT under the covers. This means that the multi-cloud solution needs 3 places to store data to fulfill the requirement of not being dependent on any cloud. Due to this, the team decided to build the platform on the 3 biggest vendors: AWS, GCP and Azure.

The three clouds are then networked together on a Form3 private network, running a Kubernetes cluster in each. Then, a NATS JetStream cluster is spanned across the three clouds. Similarly, the Cockroach database is spanned across the clouds as well. This gives the product teams a consistent architecture to build against, only having to integrate with the cloud agnostic technologies. In fact, they don't even need to care about what cloud their workload is running in!

This multi-cloud solution means that the software is much easier to write and the team only need to maintain one version of their services.

The need for running in three clouds comes down to Form3's business requirement for high consistency: payments cannot be lost or duplicated. As previously discussed, CockroachDB uses RAFT. The team have set it so that two of three clouds have to agree on the write for it to be consistent. The quorum based consensus means that there must be an odd number of nodes and the majority have to agree. Due to this configuration, the team have built a multi-cloud solution on three clouds to ensure that writes can continue to happen in the case of a one cloud/node outage.

 Networking in multi-cloud

This is one of the most challenging aspects of running a multi-cloud architecture. The team had two high-level options. The first option was to connect through Internet-based connections, but its not ideal for sensitive payment data to travel through a public network and the internet gives no guarantees of network stability or latency. The better option was to use a private network between the clouds with guaranteed latency and SLAs. In order to facilitate that, the team made a connection between each of the clouds down to the data centers. The cloud vendors provide these possibilities out of the box.

With these connections in place, the team used private CIDR ranges and sub-divided them up between the clouds. The routers in the data centers, where the cables are coming in from the cloud, have routes set up to handle these ranges and send the traffic to the correct cloud based on the range they are looking for.

On the cloud side, the connection comes into a gateway and then onto a router which is set up to route the traffic on to the VPC inside the cloud. For example, if a pod in AWS wants to send a request to a pod in GCP, then it will send the request down to the data center, and the router in the data center will send that on to GCP and the infrastructure inside the GCP will forward that request further to VPC as configured.

One final detail is that the Kubernetes clusters are setup so that the pods get allocated IP addresses within the cloud VPC. A typical Kubernetes setup sees the pod IPs in a different CIDR range to the host network. In the multi-cloud configuration, the pods are in the same CIDR range as the host VPC.

The setup allows pods to communicate with each other across the clouds and underpins the ability to run CockhroachDB in the multi-cloud setup, where CockroachDB nodes need to be able to communicate to ensure consensus.

Performance & consistency

The new multi-cloud platform has been designed for big customers, so performance is an important aspect of the solution. The performance numbers are orders of magnitude faster than the previous solution, even though it is running across multiple cloud vendors. In general, the latency between the clouds is comparable to two Availability Zones within a single cloud.

CockroachDB solves the problem of consistency for the Form3 platform. This does sacrifice a little bit of performance, as an extra request to another cloud must be made on every write. This should only be used for the hot path payments processing functionality of the platform. Reporting and other types of processing will be made off of CockroachDB.

Tackling the biggest challenges

Looking back at the project, Kevin identifies the biggest challenges that he and the platform team have overcome:

  • Networking across the clouds.
  • Operating multiple Kubernetes clusters across the clouds on the managed Kubernetes offering of each cloud vendor.
  • Cross cloud service discovery with static IP addresses with exposed DNS.

The complex multi-cloud project really pushes the boundaries of what's possible in engineering and is a testament for the great engineering talent at Form3. The project has been running for nearly two years, due to go live in 2023.

Kevin encourages anyone who is interested in solving some of these problems to check the Form3 vacancies board.

Written by

github-icongithub-icongithub-icon
Adelina Simion Technology Evangelist

Adelina is a polyglot engineer and developer relations professional, with a decade of technical experience at multiple startups in London. She started her career as a Java backend engineer, converted later to Go, and then transitioned to a full-time developer relations role. She has published multiple online courses about Go on the LinkedIn Learning platform, helping thousands of developers up-skill with Go. She has a passion for public speaking, having presented on cloud architectures at major European conferences. Adelina holds an MSc. Mathematical Modelling and Computing degree.