The Cardano Heartbeat (Crisis Event Management Platform) is a Project Catalyst fund-6 proposal to develop a platform for managing the Cardano blockchain in disaster or crisis events. View the full proposal on IdeaScale
This Project is a work in progress and is currently being developed by a team from the Armada Alliance organization.
Many small SPOs dream of their first block. Then after that block shows up a weight is lifted and the real fun begins, like waiting to pull those first rewards. But how many SPOs dream about a backup plan? You've tweaked your pool and she's purring like a lambo on the autobahn. So what could go wrong? And if something does, would you know how to recover? Well, stop staring at your pool metrics and lets explore some basic backup plan options using the 3-2-1 methodology.
A better question to ask is when. Anything could go wrong and if it impacts your pool it's a big deal. You gotta be ready to deal with it as quickly as possible given the circumstances at hand. Your next block is in 1 hour and you just lost your core node, GO!
What types of disruptions can cause emergencies?
Natural Disaster
War or Terrorism
Civil Disruption or unrest
Accidents or human error
Cyber Attacks
What are the most important points of failure to an SPO?
Power
Internet outages
Network
Critical operational Data, secret keys, and files
Human error
Simple, 3 copies of all your stuff where 2 copies are on two different media types and one copy is completely offsite. This 3-2-1 backup plan is a great way to keep your Stake Pool and its essential data safe.
Following this 3-2-1 plan we have three distinct copies of our stake pool's operational/production data. With one copy being the current data used for stake pool operation purposes (i.e., keys, certs, metadata, wallets, etc...). The other two copies are backups of the pool's operational data.
An important aspect in keeping your pool's data safe and recoverable is for all three copies of the data (operational and the two backups) to be stored in such a manner that if one or more of the copies should fail/lost you always have another copy safe and intact to recover from.
Lastly, it is vital to make sure all your data copies are all updated and kept in sync with the current operational data being used, do not update one copy and leave the other two out of sync. For example if you update the current operational data and leave the backup out of sync you will not be able to recover your stake pool in case of a crisis. All copies should contain the same data from the same exact point in time.
For the two backup copies we use two different media types. One is a hard drive and the other is a cloud based storage. That way we can be sure if one of the copies is lost or fails we can still recover from it. It is recommend to keep the cloud based backup located in a different region not near your other local copies.
In general, offsite means remotely. However, it is safe enough if you can keep at least 1 backup stored in another place long distance, i.e. not onsite. Hard drive devices fail eventually, so a perfect place for offsite would be cloud drive, NAS or network share.
Physical storages may be damaged by human error, flood, earthquake, or stolen by theft, but that is hardly appear on network drives especially on the cloud storage that offered by well-known service providers. Believe it or not, they have more strict ways to ensure data security.
Sure you can be that guy or gal, but your introducing a lot of risk. The goal is to minimize downtime and risk. The longer your pool is down the more risk you have of missing a block or the longer it'll take to sync back up to the chain. Once you're sitting on a solid plan you can use it to your pool's advantage. Advertise it as a means to draw delegates. Share your plan in your circle of influence so others can benefit as well. A good decentralized blockchain needs SPOs who are serious about minimizing downtime.
What files are important to an SPO to recover from a crisis?
Node.vkey (cold)
Node.skey (cold)
Node.opcert.counter (cold)
Node.kes.vkey (hot)
Node.kes.skey (cold)
Node.opcert (hot)
Node.vrf.vkey (cold)
Node.vrf.skey (cold)
Payment.vkey (cold)
Payment.skey (cold)
Stake.vkey (cold)
Stake.skey (cold)
Stake.address
Payment.address (hot)
Stake.cert (hot)
Metadata.json
poolMetadataHash.txt
MetadataUrl
Pool.registration.cert
Deleg.cert (hot)
DB snapshot (backup)
Network Configs
ufw/iptables
sudo ufw status numbered
sudo iptables -S
wireguard config
/etc/wireguard/wg0.conf
/root/wg
Router config/snapshot
Pool Configs
mainnet-config.json
mainnet-alonzo-genesis.json
Mainnet-byron-genesis.json
mainnet-shelley-genesis.json
mainnet-topology.json
Binaries
cardano-cli
cardano-node
Tools and Monitoring
gLiveView.sh
env
cardano-service (armada alliance optional)
armadaPing.sh (armada alliance optional)
topologyUpdater.sh
The three main types of backups are the full, incremental, and differential backup types. each with its own advantages and disadvantages. We will briefly go over each one and recommend the one that is most suitable for running your stake pool operation.
A full backup is when you do a complete "point-in-time" copy of your system and the data needed for running your stake pool to a local and/or remote storage device(s). This is fine for a single stake pool operator with limited amount of data to backup to do on a daily basis. It is recommended that for every stake pool you have at least one full backup of both your OS/image used on your node along with a copy of your production data (keys, certs, metadata, wallets, etc...). You could just do a full backup to a usb stick, repo, or to a cloud server every day and be fine, you can find our full usb stick backup script and guide here to learn more. A benefit of this method of backup is that it is the most reliable way to ensure your data is correctly and safely backed up to be used in a moments notice to recover from a disaster. The main drawback of the full backup is that it requires more resource usage from your local or cloud servers which may increase your cost of running the pool depending on your setup.
Unlike with a full backup where you copy the entire system and its data on a scheduled basis, an incremental backup will only copy the data that has changed since the last full backup was done. This can be a much more efficient way to backup your data if you are a small to medium size company that may need efficient, cost effective, reliable, and scalable data backup solutions. While this is a great solution for most business with decent amount of data, for a stake pool operator with little data that changes (other than kes certs) it is not recommended to use this method as it may be overkill. For many Linux users you can use incremental backup tools like Timeshift, or for macOS users you can use rsnaphsot or Time Machine, and for Windows 10 you can use System Restore.
Similar to how you should backup your data, there are three main backup policies or plans that you should consider the local, hybrid, and cloud backup.
The local backup strategy may work for some pools but it is risky even for the smallest of pools since in the case of a extreme event like a natural disaster, war, civil unrest, theft/robbery, or even a human error, you may lose your entire stake pool and its relevant data if you are not prepared.
The hybrid backup strategy is a combination of local and cloud backup. It can be one of the most reliable backup strategies and is the most cost effective for almost any stake pool.
Finally, we have the cloud backup strategy, which is a very reliable backup strategy as well but less cost effective in most cases and requires you to give up full ownership of your pool's hardware and sometimes even data.
References |
---|
Daily backup of your cores hot keys and operational files to a local or remote usb stick with rsync.
Tail syslog before inserting your drive. This will print some information that can help you identify the disk.
Attach the external drive and take note of the assigned device node. eg. /dev/sdb
If the target drive is lacking partition tables syslog may not print the device node assignment. fdisk -l however will.
You can also print a list of drives with fdisk.
Example output:
In my case it is /dev/sdb. Yours may be /dev/sdc, /dev/sdd or so on. /dev/sda is usually the system drive. Do not format your system drive by accident.
This will wipe the disk
Type ? to list options
Enter o for new GPT
Enter n to add a new partition and accept defaults to create a partition that spans the entire disk.
Enter w to write changes to disk and exit gdisk.
Your new partition can be found at /dev/sdb1, the first partition on sdb.
Make the usb backup drive always available to our backup job. Since it will be holding sensitive data we will mount it in a way where only root and the user cardano-node runs as can access.
Run blkid and pipe it through awk to get the UUID of the filesystem we just created.
Example output:
For myself the UUID=55e3346a-a7ba-4b60-bd68-fa8f86b8f8ca
Drop back into your regular users shell.
Add mount entry to the bottom of fstab adding your new partitions UUID and the full system path to your backup folder. For this guide we set the path to a folder we will create in our home directory. /home/username/core-backup
Replace user with the user cardano-node runs as.
nofail allows the server to boot if the drive is not inserted.
Create the mountpoint & set default ACL for files and folders with umask.
Mount the drive.
Take ownership of the filesystem.
Reboot the server and confirm the system mounted the drive at boot.
This will wipe the disk
Type ? to list options
Enter o for new GPT
Enter n to add a new partition and accept defaults to create a partition that spans the entire disk.
Enter w to write changes to disk and exit gdisk.
Set the msftdata data on the exFAT partition (also taken from Thawn's answer). Since we have only one partition, apply the command to partition 1
Your new partition can be found at /dev/sdb1, the first partition on sdb.
We want this drive to always be available to our backup job. Since it will be holding sensitive data we will mount it in a way where only root and the user cardano-node runs as can access.
Run blkid and pipe it through awk to get the UUID of the filesystem we just created.
Example output:
For me the UUID=7FFD-F67C
Drop back into your regular users shell.
Add mount entry to the bottom of fstab adding your new partitions UUID and the full system path to your backup folder. For this guide we set the path to a folder we will create in our home directory. /home/username/core-backup
Identify user id and group id and substitute for in fstab.
nofail allows the server to boot if the drive is not inserted.
Create the mountpoint & set default ACL for files and folders with umask.
Create a script that will only backup if the drive is mounted.
Create an rsync-exclude.txt file so we can rip through and grab everything we need and skip the rest.
Open crontab and add the rule to the bottom.
Create an alias in .bashrc or .adaenv if present for manual alias to backup the core.
Add the following at the bottom edit the paths and exclude as you see fit and source the changes.
Now if you want to manually backup the hot keys just type core-backup. For example after generating a new KES pair and node.cert
In order to determine the necessary skills, hardware, and good practices needed for a Stake Pool Operators to be more resilient to various unforeseen events that may take down their pools from the network we have broken the checklist into the following sections: Stake Pool Operations Recommended Skills and Resources, Resilience Options, and Redundancy
Designated off-grid power duration (12-24 hrs)
Data and Software Backup
Backup keys and passwords
Written Down on paper
Electronic backups on USB/external hard drive
Backup Configuration Files
Backup Node Software (node and cli binaries)
Backup DB snapshot
Backup Tools/software
Hardware
Spare node hardware (physical location with access)
Spare node hardware (cloud based)
Spare SSD/ hard drives
Cables
Ethernet Cables
SSD/HDD Adapters and Cables
Back Up Power Supplies for Nodes
Internet
Main ISP
Fiber
DSL
Satellite
Cable/Coaxial
Backup ISP
Cellular/4G-5G wireless
Satellite/Starlink
Secondary cloud based ISP (AWS, Azure, GCP, etc)
Secondary location ( should be out of your region) & ISP with your own hardware
Power supply
Failover
UPS
Solar panels + batteries
Generator