On Saturday September 20th, several of the Stellar validating nodes started failing. This eventually led to the network not reaching consensus on ledgers so all transactions came to a halt. Machines and the network came back about 11 hours later.
From looking at the nodes and Zabbix historical stats, it is clear that most of the instances were running low on available RAM, as such the Linux OOM (“Out Of Memory”) killer was killing off pids on the machines in a bid to survive memory exhaustion.
Below are the main points outlining the outage which lasted approximately 16 hours from 20/09/2014 ~ 02:00 UTC until 20/09/2013 ~ 18:00 UTC
- server_info : first ledger stopped being reported at 02:00 UTC
- server_info : ledger age grew until 18:00 UTC
- During the outage most nodes lost connection to the network (no ledgers, no peers)
- Visible increase in disk reads on all volumes (db,rocksdb,rocksdb-cache).
- Visible increase in outbound network traffic peaks.
- All failed servers crashed due to memory exhaustion.
Judging from the graphs we can tell some servers died and others struggled during the outage although even the nodes that survived reported errors with peers/ledgers/ledger age.
During this time, there was not adequate communication with the community. We take full responsibility for the slow response, but want to let the community know why we were not able to respond immediately in this particular instance: At the time, the majority of us were at a company off-site working on designing a big refactor/redesign of stellard (ironically to fix the issues that caused this network outage). The servers started running out of RAM overnight. In the morning, the internet at our off-site location went out (along with two backup internet connections we had provisioned). We moved to a different location and we managed to stabilize the network. However our internet continued to have issues. During that time, it looks like the Stellar cluster continued to run out of RAM as well. The situation stabilized a few hours later.
Remedial Steps Immediately Taken:
- Rebooted all the failed nodes and restarted stellard on some of the other nodes.
- Downsized the stellard node_size to medium on all servers
- Network came back online after we restarted a few more stellard
Single root cause is unknown but factors include:
- Legacy code base having scalability issues. Stellar is the largest user base live on this code base and is testing the limits of this technology in real time. This outage has made us keenly aware of the scaling limitations of the current system, which we are presently working on.
- Nodestore is leaking or using too much RAM.
- Completed: Rebuild validators with more RAM
- Completed: Add additional team members to community channels so they can report updates from the Foundation in real time.
- Completed: Reduce the monitoring alert output so we don’t miss legitimate issues.
- Investigate alternative node store backends/config parameters.
- Confirm whether application is leaking memory or not.
- Create a “status.stellar.org” page which can be easily updated with outage/progress reports.
- Continued prioritized hiring for expansion of devops team members.
- Continued rewrite of stellard to address the scalability issues.
- Continued focus to expand the diversity and number of entities running validators (Decentralization is an important goal to ensure the robustness of the network).
We again apologize for the outage and have begun work on the preventative measures to avoid this from occurring again. If you would like to suggest any other preventative measures, we want to hear them. Please send them over to [email protected]—thank you.