Blog Article

Accelerating Analytics on Stellar

Author

Isaiah Turner

Publishing date

Analytics

At the Stellar Development Foundation, we are committed to being as open and transparent as possible. In May, we released a public analytics dataset that was a huge step toward making network activity globally visible and easily queryable.

Today, we’re excited to announce another step forward — we've released a new version of our analytics dataset that's updated every 5 minutes. We also open-sourced all of our tooling, allowing you to build and deploy your own Stellar data pipelines!

Why do we need a new data system?

While the first data pipeline updated daily, a faster update rate enables more interesting use cases, like real-time decision making. We also wanted a more robust data pipeline to allow crucial observability features like automatic retries on failure and detailed logging. This real-time, highly observable pipeline required several core pieces of software.

Stellar-ETL

First, the stellar-etl extracts data from our history archives and from stellar core. It's a standalone command-line interface written in Go. ETL stands for extract, transform, and load. The stellar-etl extracts data from the history archives using the new horizon ingestion system, transforms it into JSON, and loads the data into a text file.

With the stellar-etl, you can run a simple command like:

stellar-etl export_operations -s 1000 -e 2000

and get all the operations from ledger 1000-2000.

The new horizon ingestion system uses history archives to provide detailed information about historical network activity, which is more developer friendly and faster than using the stellar-core and horizon databases. The history archives alone should not be trusted since a malicious actor could alter them, so the stellar-etl also uses a stellar-core instance to verify data. If you’re interested in learning more, check out head of Horizon Team Eric Saunders’s Meridian talk on Horizon's new architecture. Note the history archives update at each checkpoint ledger, or every 64 ledgers. Since a ledger closes every 5 seconds on average, there’s a checkpoint ledger every 5 minutes. So, the stellar-etl can update our new dataset every 5 minutes!

Building the New Pipeline

To construct a larger data pipeline, we had to orchestrate many different tasks, scale gracefully, and handle failures. We decided to use Apache Airflow, a generic task management system open sourced by Airbnb. Since it was open sourced, it's become one of the most popular tools across the industry for data science workflow management.

In Airflow, the user defines atomic units of work called tasks. A task can involve anything from sending an email to querying a database. The user can link tasks to form Directed Acyclic Graphs (DAGs), which act as complete workflows. The user can then set schedules for the defined DAGs.

Then, Airflow handles everything related to tasks: scheduling, retrying on failure, and advancing. Note that unlike our previous system, this is a component-by-component process! This means Airflow can retry individual tasks that failed instead of restarting the whole pipeline.

With Airflow, we constructed DAGs that acted as complete data pipelines. For example, our history archive export DAG exports data within a time range. The DAG’s first task converts the time range into a ledger range. Then, four different tasks call the stellar-etl export commands for ledgers, transactions, operations, and trades. The next step in the pipeline uploads each of the exported files to Google Cloud Storage. Finally, we add the files to their corresponding BigQuery tables. Check out these DAG diagrams to learn more about the DAGs and their component tasks.

Moving into Production

We considered two main paths for deploying Airflow. The first is a custom Kubernetes setup. Here, the developer defines and configures the pieces that Airflow needs. Here’s a quick checklist:

  • Deployments for Airflow webserver and scheduler pods
  • Access to DAG definitions through a Kubernetes persistent volume or by syncing with git repository
  • Backend requirements for Airflow CeleryExectuor and worker pods
  • Database for Airflow task metadata
  • Storage space for Airflow logs

As you can see, there are a lot of moving parts, but fortunately you can use existing helm charts online as templates for your Airflow setup. One such chart is available here.

The second path is Cloud Composer, the Google-managed service for deploying Airflow. This provides a working Kubernetes setup integrated with the Google Cloud Environment. The Airflow webserver, scheduler, and executor can all run in the cloud and pull from a common cloud database. Instructions for a Cloud Composer setup can be found here.

Try it Out!

We’re really excited about making our data pipeline as close to real time as possible, and we're already thinking about ways to get new data and metrics using the new tooling. For instance, we're looking into storing historical orderbooks to analyze past markets, which we can now do thanks to the stellar-etl’s ability to transform large amounts of data into a compact format. We’re also looking forward to seeing what the Stellar community can build with our new data pipeline and tooling. Have ideas? We'd love to hear them. This is a free, open-source resource for any and all to use, and we're excited to see what you can discover.

Check out the stellar-etl or our airflow setup. Feel free to open issues on Github if you have questions, concerns, or would like to request new features!