Using Promise Theory to solve the distributed consensus problem--Secure tools for sharing granular data between micro clients.

Whenever we try to fix something in one corner of IT, we seem to yank another corner out of place and create new problems–trading one conundrum for another. Microservices are a perfect example of that: the denormalization of data and centralization of processes makes teams less interdependent (solving a human issue), but creates challenges for managing shared state. Breaking up singular data stores into (not completely) independent parts can easily upset the integrity of algorithms, data, user experiences, and hence business continuity. These headaches are now more keenly felt as these matters become regulated by law (think GDPR, DORA, etc). The most prominent issue is the so-called consistency of shared data, because it underpins so much and involves technical issues that engineers can sympathize with.

Dare we ask?

The problem of data consistency remains one of those issues that continues to attract attention both from academics and practitioners. It’s famously muddled together with the notion of distributed consensus. The English meanings are basically the same, but the technical meanings are used differently. Confusingly, the latter is used as part of the usual solution to the former in standard algorithms like Paxos and Raft. Lesser known but influential results, like the FLP theorem, “prove” the impossibility of distributed consensus in an asynchronous environment and scare people away from thinking carefully. There’s also the infamous CAP conjecture whose popularization (and apparently endless revisions) added fuel to the mysticism in a generally unhelpful way.

The IT industry isn’t always good at asking probing questions–we like to trust leaders and influencers who can think for us. But those thinkers sometimes leave the problem half solved, and we end up with standard solutions that random software engineers interpret for us. Dare we question them?

Over my tenure in Computer Science, I’ve tried to clarify and even debunk the occasional myth about what can and can’t be done by developing clear models, many of which are summarised in my Treatise on Systems, often with the help of the increasingly ubiquitous Promise Theory. The data consistency issue is no different, and it turns out that there is a simple answer we’ve been missing that András Gerlits and I have been talking about quite a lot recently in connection with his Omniledger calibrator software. Of course, there may be several ways of “solving” a problem, depending on the semantics we desire. Part of the confusion about consistency lies in deciding what we consider to be an equitable solution.

Let’s apply the smallest amount of Promise Theory to show how intentional consistency of data can be engineered at the same time as we scale the scope and availability of a service.

Consistency: one of the skis is parallel!

By consistency, we obviously don’t mean whipping data into a smooth spongy texture for a dish best served fast! Consistency in IT refers to the pure undelayed homogeneity of facts throughout a system–of data values, or key-value pairs that spans several computers.

Consistency manifests as a business reliability (perhaps even security) issue for most of us. It’s now related to regulatory compliance issues, particularly in the EU, as well as matters of privacy and user experience. The related terms consensus and quorum are more subtly used to refer to how one reachesagreement about disputed facts, but in practice these all mean that we want parts of a system to reach the goal of being aligned in their promises of data.

On a ski slope, skiers are taught to keep their skis in alignment when making turns. When parallel, the skis’ directions “agree” or are consistent with each other. One might say they have reached a consensus about their direction of travel. Those who are less stylish in their parallelism sometimes joke: “well, one of the skis was parallel!”. In IT, we don’t want our systems to be split down the middle by misalignments.

We can’t actually stop it from happening, but we can prevent ourselves from ever seeing it so that it can make no difference in practice. Just how stringently this is prevented accounts for several differences in the discussions about data consistency.

Seeing is believing

If consensus is about winning an argument, then consistency is a problem about calibration of state. Any user or observer has an equal “right” or capability to measure the answers given by different independent agents and compare them to see if they are equal. That single point of measurement calibrates the outcomes of any two agents (see diagram below).

A and B are supposed to be in alignment. A claims the value is a, B clams it’s b. It’s up to C to decide whether it receives these and finds that a = b are the same.

A and B are supposed to be in alignment. A claims the value is a, B clams it’s b. It’s up to C to decide whether it receives these and finds that a = b are the same.

C observes A and B and, using this information, it is able to make a conditional promise based on that information to any other agent of interest (including the original A and B) about whether or not a=b according to its own measurements and to the best of its capabilities. C is thus a calibrator–an independent adjudicator of truth. This is how law courts work: a judge compares A and B to resolve differences or make a choice. Consensus means that A, B, and C all agree about their value for the promise. This is easy if there is an authoritative source for correct change. This is what we need to preserve.

Agreeing about different versions of a particular value is relatively easy as long as the values of a and b never change. But in a dynamically changing environment, like bumping over snow moguls, keeping a and b skis aligned depends on a race to change each. What if a changes while no one is looking and C hasn’t measured a in a while, so it still thinks that a=b, but A knows better. How can we know this? Clearly, we can’t because we already lost that race, but we need to ask whether this matters to anything that can happen. If a skier falls in a forest and no one is looking…should we care?

Central services!?

A small but growing number of voices is challenging the authority of the FLP and CAP results, pointing out that their implications have been misunderstood. The impossibility claims for distributed consensus stem from an incorrect assumption about the universality of change. The standard assumption is that it has to apply to everyone “at the same time”. But what does “at the same time” mean? If you’ve ever watched a thunderstorm, you know that sensory data (what we see and what we hear) arrive at very different times even though the strike happens at a single moment and at a single location.

Availability (readiness to listen) and consensus (agreement) are not global constraints that need to imply rigorous temporal precision, they are actually only constraints on the local behaviour of observers and influencers (users). Promise Theory is about autonomous agents, which implies the causal independence of agents: changes promised by one agent cannot influence another without an explicit acceptance by the promisee. So Promise Theory can help us to clarify where incorrect assumptions about causality go wrong.

Alignment of data ultimately boils down to how different agents in an information system signal and observe change to one another. After all it’s this change that measures the passage of time across the different partial processes. Observation is a crucial element, because we don’t notice changes (the passage of time) while we have our eyes closed. When we open them again, that’s when the change reaches us in a single tick. So everyone receives the information at their own behest: at their own pace, not when the lightning actually strikes.

If we apply this to think about how fragments in a system align their changing facts, the solution for maintaining alignment across multiple locations boils down to making sure we preserve the historical order of changes from the original source. If you smash a plate with a picture painted on it, we can reconstruct the picture later so that it’s consistent with the original as long as we put the pieces in the correct spacetime order.

An obvious and simple way to assure consistency is to have just one answer to compare to. A single copy cannot be inconsistent, so we use singular sources as sources for truth and arbitration. For perfect parallel turns, use a snowboard!

The industry standard is to force a single “master” database and replicate it for redundancy. That way, you don’t have to worry about the “master” being consistent. We don’t look at the copies too often, so that’s one way of avoiding trouble. Centralization to a single master is presented as a control decision to assure an authoritative source, i.e. a single source of control. But when the master fails, we have to worry about whether any of the backups are consistent with the master. Since we’ve now lost the original, the meaning of consistent is now ambiguous. Aligned with what? Which of the skis is parallel?

Many also point out that a central service is also a possible bottleneck (depending on its relative capacity), but the true importance of centralisation in system design is really (you guessed it) for calibration–to act as a standard measuring stick for data. It turns out we can solve both the consistency problem and the bottleneck problem alongside the partial sharing (microservice) problem all at the same time by introducing a data calibrator, and just getting the plumbing right to preserve causal order across a system. It’s a bit like a shared clock, but in which the flow of data is the clock itself! In this way, all concerns about data or network partitions, unreliable connections etc, can be resolved for each individual client locally for best effort. To make everything happen “at the same time” we can simply stop time for those who wait without a central service!