On Saturday September 20th, several of the Stellar validating nodes started failing. This eventually led to the network not reaching consensus on ledgers so all transactions came to a halt. Machines and the network came back about 11 hours later.
From looking at the nodes and Zabbix historical stats, it is clear that most of the instances were running low on available RAM, as such the Linux OOM ("Out Of Memory") killer was killing off pids on the machines in a bid to survive memory exhaustion.
Below are the main points outlining the outage which lasted approximately 16 hours from 20/09/2014 ~ 02:00 UTC until 20/09/2013 ~ 18:00 UTC
Judging from the graphs we can tell some servers died and others struggled during the outage although even the nodes that survived reported errors with peers/ledgers/ledger age.
During this time, there was not adequate communication with the community. We take full responsibility for the slow response, but want to let the community know why we were not able to respond immediately in this particular instance: At the time, the majority of us were at a company off-site working on designing a big refactor/redesign of stellard (ironically to fix the issues that caused this network outage). The servers started running out of RAM overnight. In the morning, the internet at our off-site location went out (along with two backup internet connections we had provisioned). We moved to a different location and we managed to stabilize the network. However our internet continued to have issues. During that time, it looks like the Stellar cluster continued to run out of RAM as well. The situation stabilized a few hours later.
Single root cause is unknown but factors include:
We again apologize for the outage and have begun work on the preventative measures to avoid this from occurring again. If you would like to suggest any other preventative measures, we want to hear them. Please send them over to [email protected]—thank you.