This week, the Stellar network experienced a ledger fork that is related to a failure of the underlying Ripple/Stellar consensus system. We are completing our review of the impact, but early reports indicate that the impact was not major. We are reaching out to all the known gateways and exchanges to see what we can do to assist.
Given how novel consensus systems are and our belief that we should be transparent about the technology, which includes its strengths and weaknesses, we wanted to provide some context for the ledger fork and how it happened in Stellar.
Issue 1: Sacrificing safety over liveness and fault tolerance—potential for double spends
The Fischer Lynch Paterson impossibility result (FLP) states that a deterministic asynchronous consensus system can have at most two of the following three properties: safety (results are valid and identical at all nodes), guaranteed termination or liveness (nodes that don’t fail always produce a result), and fault tolerance (the system can survive the failure of one node at any point). This is a proven result. Any distributed consensus system on the Internet must sacrifice one of these features.
The existing Ripple/Stellar consensus algorithm is implemented in a way that favors fault tolerance and termination over safety. This means it prioritizes ledger closes and availability over everyone actually agreeing on what the ledger is—thus opening up several potential risk scenarios. For instance, there may be situations in which nodes diverge on different ledgers and a person can’t be certain which ledger the network will ultimately decide to take. If the network switches to another ledger chain, then transactions from the discarded chain will be invalidated—this opens the network up to potential double-spend problems.
Issue 2: Provable correctness
Prof. David Mazières, head of Stanford’s Secure Computing Group, reviewed the Ripple/Stellar consensus system and reached the conclusion that the existing algorithm was unlikely to be safe under all circumstances. Based these findings, we decided to create a new consensus system with provable correctness. This effort, led by Prof. Mazières, is underway. His white paper and the accompanying code are expected to be released in a few months.
Prof. Mazières’s research indicated some risk that consensus could fail, though we were nor certain if the required circumstances for such a failure were realistic. This week, we discovered the first instance of a consensus failure. On Tuesday night, the nodes on the network began to disagree and caused a fork of the ledger. The majority of the network was on ledger chain A. At some point, the network decided to switch to ledger chain B. This caused the roll back of a few hours of transactions that had only been recorded on chain A. We were able to replay most of these rolled back transactions on chain B to minimize the impact. However, in cases where an account had already sent a transaction on chain B the replay wasn’t possible.
We are still investigating the triggers for this consensus failure, but believe it is caused by the innate weaknesses of the Ripple/Stellar consensus system outlined above compounded by the number of accounts in the network. Presently, we have approximately 140,000 active accounts a week and over 3 million total accounts which is in excess of the approximately 120,000 total accounts (active and inactive) this stack has previously supported.
Our monitoring of the network has made it clear that the underlying Ripple/Stellar consensus system is not performing at this level of scale, which is still small relative to the global financial system. In order for such protocols to perform at real-world levels with the expected degree of safety, this number of accounts should not be a problem.
If you have questions about what this means for your implementation, please feel free to email us at [email protected]