This document describes a list of known issues and potential recovery steps.
Horizon will give an error similar to
Gap detected in stellar-core database. Please recreate Horizon DB.
Horizon and stellar-core run independently of each other.
stellar-core produces meta data information that is then imported by Horizon.
Gaps occur when for some reason, Horizon doesn’t find data for a ledger that it didn’t import yet.
This can happen for a few reasons.
stellar-core uses “cursors” to ensure that garbage collection doesn’t delete data needed by consumers.
If cursors have not been configured, what can happen is that before Horizon has a chance to set a cursor, stellar-core’s garbage collection can run and delete some data that Horizon expects.
Solution is to properly define cursors in stellar-core’s configuration:
KNOWN_CURSORS=["HORIZON"]
When running stellar-core with partial history CATCHUP_COMPLETE=false
, stellar-core’s policy is to keep up with the network with a contiguous tail of at least CATCHUP_RECENT
ledgers.
As a consequence, if stellar-core is taken offline for any reason, when it’s powered back on, its goal is to catchup to the current ledger N
from the network as quickly as possible by replaying ledgers that will include all ledgers from N-CATCHUP_RECENT
up to N
.
If stellar-core is offline longer than roughly CATCHUP_RECENT
* 5 seconds (the average time between ledgers), it’s possible that it will not replay certain ledgers.
Horizon on the other hand expects to replay all ledgers passed its initial ledger.
Here is an example to illustrate all this.
Assume Horizon started to ingest at ledger 10,000, it therefore expects stellar-core to emit data for all ledgers past 10,000.
stellar-core is configured with CATCHUP_RECENT=1024
(roughly 85 minutes of tail ledgers).
everything is running fine until one day at ledger 500,000 stellar-core is taken down for a day. At this point we have:
Later, stellar-core is taken back online. The network happens to be at ledger 520,000 (that’s ~27 hours later), which causes stellar-core to
Horizon was at 500,000 but now sees ledger 518,976 … where is 500,001?
It panics with
Gap detected in stellar-core database. Please recreate Horizon DB.
Some bugs can cause Horizon to get confused during ingestion.
In many cases resetting your instance is the simplest way to recover: the data will be reconstructed from history.
It’s a good occasion to double check that your configuration in both stellar-core and Horizon are consistent with each other.
Pros:
Cons:
Note: if stellar-core is configured with CATCHUP_COMPLETE=true
you can either switch your node to partial history (as it was already ingested by Horizon) or reset everything.
MY_LEDGER_NUMBER
) that Horizon ingested
MY_LEDGER_NUMBER
, substract the two to get the gap between your instance and SDF’sCATCHUP_COMPLETE=false
CATCHUP_RECENT=50000
KNOWN_CURSORS=["HORIZON"]
In this case, you have a stellar-core instance that works properly (it has the ability to close ledgers without error), but you realize that for some reason its history is incomplete, in the sense that the oldest ledger available for Horizon to import is not old enough.
Many of the same reasons that can lead to this issue include some of the same problems that lead to gaps being detected.
Follow the steps in mismatched core/Horizon configurations
In particular, if the reason you have missing data is because of garbage collection, you want to make sure that it doesn’t happen again in the future!
Important, before attempting any recovery process:
The easiest way to recover is to reset everything.
This will take some time for the running processes, but it doesn’t require manual intervention.
The general idea behind this method is to construct the missing data on a separate stellar-core instance, followed by an “import” into the instance you are trying to recover.
Note that Horizon does not support backfilling: it will have to reingest all data from stellar-core.
HORIZON
cursor to 0 (which will stop garbage collection from continuing)stellar-core --conf X.conf 'setcursor?id=HORIZON&cursor=0'
SELECT ledgerseq FROM ledgerheaders ORDER BY ledgerseq ASC LIMIT 1;
This will return the smallest ledger that your stellar-core instance currently knows about.
We’ll call this value ORIGINAL_LOW_LEDGER
for now on.
We’ll call LOW_LEDGER
the value that you want in history.
Steps here are going to reconstruct history for the range LOW_LEDGER .. ORIGINAL_LOW_LEDGER
newdb
stellar-core --conf X.conf --catchup-at LOW_LEDGER --catchup-to ORIGINAL_LOW_LEDGER
If this succeeds you can proceed to next step - merging.
Repeat the following steps for the following sql tables:
ledgerheaders
(ledgerseq
)txhistory
(ledgerseq
)txfeehistory
(ledgerseq
)
Not imported from history as of this writing:scphistory
scpquorums
We’ll use ledgerheaders
here to illustrate.
ledgerheaders
(save into a file ledgerHeaders.sql
)SELECT * from ledgerheaders WHERE ledgerseq >= LOW_LEDGER AND ledgerseq < ORIGINAL_LOW_LEDGER
ORIGINAL_LOW_LEDGER - 1
is the one stored in ledger ORIGINAL_LOW_LEDGER
.SELECT ledgerhash FROM ledgerheaders WHERE ledgerseq = ORIGINAL_LOW_LEDGER-1
SELECT prevhash FROM ledgerheaders WHERE ledgerseq = ORIGINAL_LOW_LEDGER
Verify that this value is the same than the ledgerhash
returned on the previous step.
SELECT lh1.ledgerseq FROM ledgerheaders AS lh1 WHERE lh1.ledgerseq NOT IN ( SELECT lh2.ledgerseq-1 FROM ledgerheaders AS lh2 WHERE lh2.ledgerseq = lh1.ledgerseq + 1);
When this is completed, you can start stellar-core and wait for it to catch up to the network.
Finally, you can reset Horizon and have it ingest all data from stellar-core
.