At 1:14pm Pacific time, May 15th, the Stellar network halted for 67 minutes due to an inability to reach consensus. During that time no ledgers were closed and no transactions were processed — basically, Stellar stopped.
However, the ledger state remained safe and consistent across the network. Stellar has roughly 150,000 users every day and over 3 million total accounts. No one lost their money; no one’s balances were confused by a fork. At 2:21, ledgers began closing where they left off, and the network is healthy this morning.
Needless to say, an outage like this is highly undesirable, and it uncovered a few improvements we need to make. Here are the main takeaways, which we expand upon below.
As a fundamental design choice, Stellar prefers consistency and partition resilience over liveness. In other words, when faced with consensus uncertainty, the Stellar Consensus Protocol (SCP) prefers to halt rather than operate in a potentially inconsistent state. This is different from other blockchains, in which “the chain must go on” even at the price of soft forks.
Financial institutions prefer downtime over inconsistent data, that’s why they choose Stellar. It’s much better for a financial network to go offline temporarily than to produce permanent false or disputed results.
Still, with the right tooling, Stellar shouldn’t need to halt. Here’s how we will mitigate future risk:
Even before this halt, we’d been working on improving the reporting capabilities of Stellar-core. Stellar-core 11.1.0RC already contains a command for getting a full transitive quorum set report. Other monitoring commands will be prioritized.
Stellar’s Increasing Decentralization For the past few months the Stellar community has been hard at work setting up new validators and building diverse quorum sets, so Stellar works without SDF’s direct involvement. You can read more about this effort in SatoshiPay’s recent post.
Many of these new nodes are still working toward the standard of availability that the network expects. In the past few weeks we saw, repeatedly, misconfigured or down validators hampering consensus. This led to flaky liveness status in which an additional failure or two at the wrong time could bring the whole network to a halt. And that’s exactly what happened yesterday: Keybase took down their validator for maintenance at a time when other validators were shaky or down, and Stellar stopped.
Here’s how we’ll keep this from happening again:
In response to yesterday’s halt, key validators on the network coordinated a configuration change in which quourumsets were reduced to only include highly available validators. Within an hour, the network was alive, processing transactions and closing ledgers.
UPDATE: The outage on May 15 left the Stellar network in a fragile state, with only 4 parties as the core validators. Later, we experienced a brief additional problem as the quorum sets of two parties no longer had sufficient overlap with the other two. The network was taken down briefly while we repaired this.