Safety, liveness and fault tolerance—the consensus choices

This week, the Stellar network experienced a ledger fork that is related to a failure of the underlying Ripple/Stellar consensus system. We are completing our review of the impact, but early reports indicate that the impact was not major. We are reaching out to all the known gateways and exchanges to see what we can do to assist.

Given how novel consensus systems are and our belief that we should be transparent about the technology, which includes its strengths and weaknesses, we wanted to provide some context for the ledger fork and how it happened in Stellar.

Issue 1: Sacrificing safety over liveness and fault tolerance—potential for double spends

The Fischer Lynch Paterson impossibility result (FLP) states that a deterministic asynchronous consensus system can have at most two of the following three properties: safety (results are valid and identical at all nodes), guaranteed termination or liveness (nodes that don’t fail always produce a result), and fault tolerance (the system can survive the failure of one node at any point). This is a proven result. Any distributed consensus system on the Internet must sacrifice one of these features.

The existing Ripple/Stellar consensus algorithm is implemented in a way that favors fault tolerance and termination over safety. This means it prioritizes ledger closes and availability over everyone actually agreeing on what the ledger is—thus opening up several potential risk scenarios. For instance, there may be situations in which nodes diverge on different ledgers and a person can’t be certain which ledger the network will ultimately decide to take. If the network switches to another ledger chain, then transactions from the discarded chain will be invalidated—this opens the network up to potential double-spend problems.

Issue 2: Provable correctness

Prof. David Mazières, head of Stanford’s Secure Computing Group, reviewed the Ripple/Stellar consensus system and reached the conclusion that the existing algorithm was unlikely to be safe under all circumstances. Based these findings, we decided to create a new consensus system with provable correctness. This effort, led by Prof. Mazières, is underway. His white paper and the accompanying code are expected to be released in a few months.

What happens when consensus was not reached: a fork in the ledger

Prof. Mazières’s research indicated some risk that consensus could fail, though we were nor certain if the required circumstances for such a failure were realistic. This week, we discovered the first instance of a consensus failure. On Tuesday night, the nodes on the network began to disagree and caused a fork of the ledger. The majority of the network was on ledger chain A. At some point, the network decided to switch to ledger chain B. This caused the roll back of a few hours of transactions that had only been recorded on chain A. We were able to replay most of these rolled back transactions on chain B to minimize the impact. However, in cases where an account had already sent a transaction on chain B the replay wasn’t possible.

We are still investigating the triggers for this consensus failure, but believe it is caused by the innate weaknesses of the Ripple/Stellar consensus system outlined above compounded by the number of accounts in the network. Presently, we have approximately 140,000 active accounts a week and over 3 million total accounts which is in excess of the approximately 120,000 total accounts (active and inactive) this stack has previously supported.

Our monitoring of the network has made it clear that the underlying Ripple/Stellar consensus system is not performing at this level of scale, which is still small relative to the global financial system. In order for such protocols to perform at real-world levels with the expected degree of safety, this number of accounts should not be a problem.

Future actions: steps to building a consensus algorithm that can withstand a meaningful level of activity

  • One validator node to ensure no ledger forks: This situation has led us to believe it is no longer safe to run the existing Ripple/Stellar consensus system with more than one validating node because doing so would expose funds in the network to potential double spends and ledger forks. To ensure no ledger forks going forward in Stellar, we have decided to temporarily only run one validating node until the new consensus algorithm is live. Therefore, like the previous partial payments flag issue, this risk will no longer exist in Stellar.
  • Prioritization of development resources: Given this real world occurrence of the consensus system’s previously theoretical risks, it is clear that we must prioritize the development of the new Stellar consensus algorithm and move away from the legacy consensus system to increase safety. The new Stellar consensus algorithm will not only be provably correct but also prioritize safety and fault tolerance over guaranteed termination. We believe this is a better choice since it is preferable for the system to pause than to enter divergent and contradictory states. You can keep abreast of progress on this via our Github.

If you have questions about what this means for your implementation, please feel free to email us at [email protected]

Get the latest Stellar developer news.

List of posts

Stellar Community Chat

Recent posts

The Philippines is Now Connected to Stellar
FinTech, Educators, NGOs: Get a Sponsorship from Stellar.org
Blockchain for Content in the Cloud: LeFinance & Stellar Partner to Improve Efficiency for Billions