How Does Shardeum Detect Lost Validator Nodes?
Find out how Shardeum efficiently detects and handles lost validator nodes due to downtime or unresponsiveness with an insight into associated...
Find out how Shardeum efficiently detects and handles lost validator nodes due to downtime or unresponsiveness with an insight into associated...
Shardeum is one the most innovative networks of the modern era with many novel features and technologies such as dynamic state sharding, autoscaling and network modes. These features and technologies are a result of the cutting-edge Shardus protocol layer powering it. One of these features is called “lost node detection.”
In a nutshell, lost node detection on Shardeum is crucial for maintaining the reliability and efficiency of the network. It ensures that node failures are quickly identified and addressed, allowing the network to adjust its node list and maintain optimal performance. This process is vital for sustaining the network’s decentralized architecture and ensuring that it remains secure and scalable.
On typical layer 1 networks, a common penalty for downtime is slashing. However, Shardeum does things slightly differently – instead of immediately slashing for downtime, we use a unique feature called lost node detection. After an active node has a prolonged period of downtime or unresponsiveness, a node is then classified as a lost node. Accompanying this lost node status, will be a slashing penalty for nodes that remain lost and do not successfully refute being lost.
Lost nodes have a chance to prove their responsiveness by refuting this designation as lost node and thus become actively participating nodes again. The slashing penalty will be integrated in the near future. Initially, we will have a modest percentage for slashing penalties on each non-refuting lost node (i.e. nodes that are actually down or non-operational for a sufficiently long period of time). Over time, we may have to iteratively adjust these percentages, ensuring a gradual and strategic calibration to achieve the ideal equilibrium.
Note: This is applicable for validators. Similarly we have a slashing/lost node detection for archivers which we will discuss in another blog.
Before we proceed to break down and explore the steps to detect lost validator nodes, let’s define some of the key terms and components so it will be easy for you to follow the blog.
Like much of the technology on Shardeum, lost node detection is deceptively complex and can be difficult to grasp. Luckily, we’re here to help! The lost node detection process can be subdivided into 8 steps – not every step is always part of the process, but all are necessary for the lost node detection process at various points in time to be secure and effective. Here are the 8 steps listed down followed by the details under each of them.
Imagine the following scenario. A node, which we will call the sender node (Node S), finds that another node, the target node (Node T), is down or unresponsive. The sender node then creates an “investigate” transaction to check on the target node’s status. This constitutes the initiation of the investigate transaction process.
The “investigate” transaction is a formal request that ensures a standard procedure is followed when a node appears to be inactive, which is necessary for maintaining the order and consistency of network operations. This prevents any non-procedural or unplanned responses to node failures and provides a verifiable method of initiating a node check.
The second phase of the process involves a new node called the investigator node (Node I). The investigator node is chosen based on its proximity to the hash of the target node + the previous cycle marker (T+C1). Choosing the investigator node in this way ensures a deterministic and verifiable method. It makes it hard for nodes to predict or influence who will be the investigator node, which increases security.
Additionally, using the hash function prevents bias or manipulation in choosing the investigator node, as the hash function’s output is essentially random but deterministic when given the same input. Once the investigator node is chosen, a “node down” transaction is created. This “node down” transaction contains the original “investigate” transaction signed by the sender node and is also now signed by the investigator node. Finally, the transaction must be sent during the first quarter of cycle marker 1 (Q1 of C1).
The investigator node starts the investigation upon receiving the message. The investigator node then checks if the target node is actually down. If the target node is not down, the investigator takes no further action and the process is complete. Having the investigator node check the status of the target node prevents the network from taking unnecessary action if the target node is not actually down. This approach increases efficiency and reduces network noise; if the target node isn’t actually lost, there’s no need to further disseminate any additional messages.
The investigator node creates a “node down” transaction if it finds the target node is in fact down or unresponsive. This “node down” transaction is then gossiped during the first quarter of cycle marker 2 (Q1 of C2). The creation of a “node down” transaction by the investigator node provides a documented and timestamped record of the investigation’s findings. It serves as a formal declaration of target node’s status, which is essential for network participants to agree on the state of the network and act accordingly.
During the first quarter of cycle marker 2 (Q1 of C2), nodes receive transactions either from the investigator node or from other previously engaged nodes. Subsequently, they propagate these transactions by gossiping them to additional nodes. This is to ensure that regardless of the source, that there is rapid dissemination of critical information about node failures, enhancing the network’s resilience.
However, during the second quarter of cycle marker 2 (Q2 of C2), nodes will ignore any transaction if it is received from the investigator node but gossip it if received from other nodes. The transactions received directly from the investigating node are ignored at this point to prevent unnecessary transactions and potential spam from over-reporting, while transactions received from nodes other than the investigator node are still gossiped to maintain network integrity and ensure all nodes receive the update, particularly if they missed the initial broadcast in Q1. This protocol design helps optimize network traffic and ensures that information about node status is both quickly and widely shared without overwhelming the network with redundant messages.
During the first quarter of cycle marker 3 (Q1 of C3), the target node can issue a “node up” transaction if it is not down or unresponsive as claimed. This invalidates the previous “node down” transaction. It also allows the target node to formally declare its return to operational status, which is necessary to correct its previous status as “down” and prevent the network from routing around or excluding the target node unnecessarily from the active set. Furthermore, it gives control back to the target node to declare its operational status, ensuring that nodes have the ability to correct their designation autonomously.
During the third quarter of cycle marker 3 (Q3 of C3), nodes that have received both the ‘node down’ and ‘node up’ transactions through the gossip protocol collaboratively decide whether the target node (Node T) should be marked as lost. This ensures that the decision is based on a consensus among those nodes directly informed by the relevant transactions, preventing premature judgments about the target node’s status while at the same time taking steps to slash the node that is actually down. By participating in this evaluation, these informed nodes contribute to a collective and accurate assessment, maintaining the network’s reliability and up-to-date node records.
Like all robust systems, edge cases must be accounted for in order to handle even the most unlikely of potential scenarios. For this reason, lost node detection has inbuilt mechanisms to address edge cases that could induce system failures into an otherwise technologically sound system. This is how lost node detection addresses these edge cases:
Lost node detection is a unique mechanism that is designed to efficiently and effectively detect and manage active nodes that become unresponsive or lose connectivity. This technology, consisting of eight key steps, efficiently identifies and manages nodes that lose connectivity or become unresponsive, marking them as ‘lost’ without premature penalization. Nodes can restore their active status by submitting a transaction assuming they resume responsiveness or are wrongly labelled as lost. The lost node detection process is designed to maintain network integrity and performance by allowing these ‘lost’ validator nodes to regain active status, all without imposing harsh penalties, enhancing the network’s resilience and adaptability.