James Millar

View Original

Getting to Grips with CAP Theorem

If you’re a software developer working on any kind of cloud platform it’s highly likely that someone has already mentioned the CAP theorem.  If they haven’t already then, they will do soon for sure.

CAP theorem is also known as Brewers theorem after the computer scientist Eric Brewer.

Essentially CAP theorem is a law within computer science that states that it’s impossible for a  distributed data store to simultaneously provide more than two of the following three guarantees:

·       Consistency

·       Availability

·       Partition tolerance

Now a lot has been written about CAP theorem and we’re not going to go into too much depth but it’s important to have an understanding of what this theorem is as it will allow you to recognize some of the constraints and tradeoffs that you’re going to encounter when designing your cloud system.

So let’s break down those three guarantees.

Consistency

So the C in C-A-P stands for Consistency.  By consistency we mean that every read operation on the data store will return the most recent data. 

Imagine an airline application where a user checks in for their flight and changes their seat. In a consistent system, as soon as any other user requests the seat map details, even a few milliseconds later, they're going to see that seat as unavailable, they will see the most recent data. 

Now if you're familiar with ACID transactions in a relational database, there's some cross over here with this idea of consistency, though they're not exactly the same thing. But indeed, we might use database transactions to support the C in CAP theorem. 

Availability

The A in CAP stands for availability.  When we’re talking about availability within CAP theorem, we say that every request must receive a response.

Now we know we can never achieve perfect availability, but if we follow best practice then we can build availability into our system from the start and then we might hope to achieve a system where, for example, 99. 9% of the requests will receive a response.

Now with data stores in the cloud such as Azure SQL and Cosmos DB, high availability is built right in but the design of your architecture is still important.  What happens if a data center goes offline, is that data replicated to another region?

Partition Tolerance

The P in CAP stands for partition tolerance.  Now you see the word partition crop up a lot in computer science.  We’re not talking about a data partition here or a disk partition.  In CAP Theorem the partition is a network partition.  This is a partition between 2 parts of a system.  Something that splits the network and causes parts of the system to become unavailable.  In some ways this isn’t dissimilar to the idea of graceful degradation.  Expecting a system to remain responsive even when features we'd hope for aren't there. 

Imagine we have a system that’s composed of 4 components.  A,B C and D.

Let’s say that nodes A and B can communicate, and nodes C and D can communicate but there is a problem which is stopping A and B from talking to C and D.  Well, if we are partition tolerant then we will still allow A and B and C and D to continue to operate and do their work as normal.

While CAP theorem is specifically talking about network partitions, we’re not just talking about network level errors here.  We’re talking about any kind of failure that takes a component offline such as hardware failures or software maintenance.

Ok now what?

So now we know what the C A and P refer too but that’s only half of the story. 

You see, Eric Brewer, the computer scientist who came up with this, stated that a distributed shared data system can only have two of the three guarantees shown here.

 So the point of this C-A-P is that we're saying a system cannot be available and consistent and partition tolerant all of the time, and that in the design of a distributed system, you need to find an intersection of two of these three qualities, the one that you're most comfortable with. 

Now just to be clear, this doesn't mean we just give one of these features up, because if things are running successfully, we can continue to do this and have no issues. We’re more interested in what we do when there is a problem, what will we prioritize? 

Now technically one intersection between the two would be the one between consistency and availability, but this is an unlikely spot for cloud systems, because a lot of what we've been doing is introducing partition tolerance, we've been working with redundancy and queueing and caching and other techniques.

 So let's make things a little easier, because most typically with this CAP theorem and with cloud architectures, you will begin by assuming that partition tolerance is something that you want, and that if you need to pick two things, partition tolerance will be one of them, then you just need to prioritize either consistency or availability. 

So, if we decide that two of the three things that we're interested in, the intersection between partition tolerance and consistency is important to us, that says we want to maintain consistency and have partition tolerance, even if this means sacrificing availability. So, if there is a network partition, an issue with the network, then I am willing for the system, or at least parts of it, to stop being available and stop processing requests to make sure that the system stays in a consistent state. 

But then there's also another choice, the intersection of partition tolerance and availability, where we might want our system to be highly available and resilient to network issues, and we'd be willing to sacrifice consistency to get there. It's not that we give up consistency altogether. We can assume that most systems are eventually consistent. 

But as an example of this, if I update my profile on a social media site, maybe other users on other continents won't see my latest updates for seconds or minutes, or even hours. These systems are not immediately consistent. Now the system will probably be eventually consistent, and other users will eventually see my updates, but this might happen, for example, because we prioritize the system to be available and partition tolerant, and we're using geo-replication in our storage, but we decide not to use distributed transactions to make all updates instantly pushed out around the world, because distributed transactions can be expensive, and they could also make the system unavailable if there's any issues during a network partition.

So, with that kind of situation, we might have a database that will asynchronously replicate data across all the different regions, and eventually all that data will be consistent.

But we don't worry about up-to-the-second consistency as long as the different pieces of the system stay up and available. Now obviously the choice that you make here depends on your business. Does it need to prioritize consistency or does it prioritize availability. 

So, where an airline booking system might value consistency, a social media site might value availability. Now it's true this isn't always a simple choice, and it's also true it's not one choice for an entire system. You might have pieces of the system that are A-P, and other pieces that are C-P. If you have financial transactions in the system, you might need them to be consistent and partition tolerant, whereas the user profile part of the system could be available and partition tolerant. The CAP theorem just says that what you won't be able to do is build the entire system in a way that makes all three guarantees for availability and consistency and partition tolerant all of the time.