How Does Shardeum Detect Lost Validator Nodes?

How Does Shardeum Detect Lost Validator Nodes?

Find out how Shardeum efficiently detects and handles lost validator nodes due to downtime or unresponsiveness with an insight into associated...

Back to top

Introduction

Shardeum is one the most innovative networks of the modern era with many novel features and technologies such as dynamic state sharding, autoscaling and network modes. These features and technologies are a result of the cutting-edge Shardus protocol layer powering it. One of these features is called “lost node detection.”

In a nutshell, lost node detection on Shardeum is crucial for maintaining the reliability and efficiency of the network. It ensures that node failures are quickly identified and addressed, allowing the network to adjust its node list and maintain optimal performance. This process is vital for sustaining the network’s decentralized architecture and ensuring that it remains secure and scalable.

Node Downtime Penalties on Shardeum

On typical layer 1 networks, a common penalty for downtime is slashing. However, Shardeum does things slightly differently – instead of immediately slashing for downtime, we use a unique feature called lost node detection. After an active node has a prolonged period of downtime or unresponsiveness, a node is then classified as a lost node. Accompanying this lost node status, will be a slashing penalty for nodes that remain lost and do not successfully refute being lost.

Lost nodes have a chance to prove their responsiveness by refuting this designation as lost node and thus become actively participating nodes again. The slashing penalty will be integrated in the near future. Initially, we will have a modest percentage for slashing penalties on each non-refuting lost node (i.e. nodes that are actually down or non-operational for a sufficiently long period of time). Over time, we may have to iteratively adjust these percentages, ensuring a gradual and strategic calibration to achieve the ideal equilibrium.

Note: This is applicable for validators. Similarly we have a slashing/lost node detection for archivers which we will discuss in another blog.

Terminologies to Keep in Mind

Before we proceed to break down and explore the steps to detect lost validator nodes, let’s define some of the key terms and components so it will be easy for you to follow the blog.

  • Active nodes: Active nodes are validator nodes actively engaging in consensus and validation. All active nodes start as standby nodes before being randomly selected and cycled into the active set
  • Cycles: Cycles approximately correspond to 60 second intervals on Shardeum; they consist of 4 quarters: Q1, Q2, Q3, Q4
  • Node ID: A unique numerical identifier assigned to each node participating in the operations which facilitates node identification and communication in the network
  • I = Investigator Node; The node assigned to check whether the target node (T) is active or inactive.
  • T = Target Node; The node being investigated for activity or inactivity
  • S = Sender Node; The node that initiates the investigation process, suspecting that the target node (T) might be inactive.
  • C1 = Cycle Marker 1; A reference point indicating the previous cycle marker in the network’s operation when the investigation is initiated.
  • C2 = Cycle Marker 2; A reference point indicating a specific cycle or period in the network’s operation when a “node down” transaction is initiated.
  • C3 = Cycle Marker 3; A reference point indicating a specific cycle or period in the network’s operation when a “node up” transaction is initiated.
  • sign(S) = Signature of Node S; A digital signature from the sender node (Node S), authenticating that it is indeed the node that created the “investigate” transaction.
  • sign(I) = Signature of Node I; A digital signature from the investigator (Node I), authenticating that it is indeed the node that created the “node down” transaction.
  • sign(T) = Signature of Node T; A digital signature from the target node (Node T), authenticating that it is indeed the node that created the “node up” transaction.

Steps to Detect Lost Validator Nodes

Like much of the technology on Shardeum, lost node detection is deceptively complex and can be difficult to grasp. Luckily, we’re here to help! The lost node detection process can be subdivided into 8 steps – not every step is always part of the process, but all are necessary for the lost node detection process at various points in time to be secure and effective. Here are the 8 steps listed down followed by the details under each of them.

  1. Investigate Transaction Initiation
  2. Selecting the Investigator Node (Node I)
  3. Investigation Process
  4. Node Down Transaction Creation (If Node T is Down)
  5. Gossiping the Down Transaction
  6. Node Up Transaction Creation (If Node T is Up)
  7. Resolution of Transactions
  8. Handling Edge Cases

1. Investigate Transaction Initiation

Imagine the following scenario. A node, which we will call the sender node (Node S), finds that another node, the target node (Node T), is down or unresponsive. The sender node then creates an “investigate” transaction to check on the target node’s status. This constitutes the initiation of the investigate transaction process.

The “investigate” transaction is a formal request that ensures a standard procedure is followed when a node appears to be inactive, which is necessary for maintaining the order and consistency of network operations. This prevents any non-procedural or unplanned responses to node failures and provides a verifiable method of initiating a node check.

2. Selecting the Investigator Node (Node I)

The second phase of the process involves a new node called the investigator node (Node I). The investigator node is chosen based on its proximity to the hash of the target node + the previous cycle marker (T+C1). Choosing the investigator node in this way ensures a deterministic and verifiable method. It makes it hard for nodes to predict or influence who will be the investigator node, which increases security.

Additionally, using the hash function prevents bias or manipulation in choosing the investigator node, as the hash function’s output is essentially random but deterministic when given the same input. Once the investigator node is chosen, a “node down” transaction is created. This “node down” transaction contains the original “investigate” transaction signed by the sender node and is also now signed by the investigator node. Finally, the transaction must be sent during the first quarter of cycle marker 1 (Q1 of C1).

3. Investigation Process

The investigator node starts the investigation upon receiving the message. The investigator node then checks if the target node is actually down. If the target node is not down, the investigator takes no further action and the process is complete. Having the investigator node check the status of the target node prevents the network from taking unnecessary action if the target node is not actually down. This approach increases efficiency and reduces network noise; if the target node isn’t actually lost, there’s no need to further disseminate any additional messages.

4. Node Down Transaction Creation (If Node T is Down)

The investigator node creates a “node down” transaction if it finds the target node is in fact down or unresponsive. This “node down” transaction is then gossiped during the first quarter of cycle marker 2 (Q1 of C2). The creation of a “node down” transaction by the investigator node provides a documented and timestamped record of the investigation’s findings. It serves as a formal declaration of target node’s status, which is essential for network participants to agree on the state of the network and act accordingly.

5. Gossiping the Node Down Transaction

During the first quarter of cycle marker 2 (Q1 of C2), nodes receive transactions either from the investigator node or from other previously engaged nodes. Subsequently, they propagate these transactions by gossiping them to additional nodes. This is to ensure that regardless of the source, that there is rapid dissemination of critical information about node failures, enhancing the network’s resilience.

However, during the second quarter of cycle marker 2 (Q2 of C2), nodes will ignore any transaction if it is received from the investigator node but gossip it if received from other nodes. The transactions received directly from the investigating node are ignored at this point to prevent unnecessary transactions and potential spam from over-reporting, while transactions received from nodes other than the investigator node are still gossiped to maintain network integrity and ensure all nodes receive the update, particularly if they missed the initial broadcast in Q1. This protocol design helps optimize network traffic and ensures that information about node status is both quickly and widely shared without overwhelming the network with redundant messages.

6. Node Up Transaction Creation (If Node T is Up)

During the first quarter of cycle marker 3 (Q1 of C3), the target node can issue a “node up” transaction if it is not down or unresponsive as claimed. This invalidates the previous “node down” transaction. It also allows the target node to formally declare its return to operational status, which is necessary to correct its previous status as “down” and prevent the network from routing around or excluding the target node unnecessarily from the active set. Furthermore, it gives control back to the target node to declare its operational status, ensuring that nodes have the ability to correct their designation autonomously.

7. Resolution of Transactions

During the third quarter of cycle marker 3 (Q3 of C3), nodes that have received both the ‘node down’ and ‘node up’ transactions through the gossip protocol collaboratively decide whether the target node (Node T) should be marked as lost. This ensures that the decision is based on a consensus among those nodes directly informed by the relevant transactions, preventing premature judgments about the target node’s status while at the same time taking steps to slash the node that is actually down. By participating in this evaluation, these informed nodes contribute to a collective and accurate assessment, maintaining the network’s reliability and up-to-date node records.

8. Handling Edge Cases

Like all robust systems, edge cases must be accounted for in order to handle even the most unlikely of potential scenarios. For this reason, lost node detection has inbuilt mechanisms to address edge cases that could induce system failures into an otherwise technologically sound system. This is how lost node detection addresses these edge cases:

  • Alternate Investigator Node: If the investigator node is down, the sender node submits another “investigate” transaction in the next cycle.
  • Rogue Investigator Node: If the investigator node ignores the investigation, the sender node can resend the transaction to a potentially different investigator node after two cycles. This action is a fallback to ensure that the investigation is completed, preserving the integrity and accuracy of node status reporting within the network.
  • Duplicate Investigations: The investigator node should ignore repeated “investigate” messages for the same target node.
  • Investigator Node and Sender Node Same: If the sender node and the investigator node are the same, the sender node can’t initiate the “investigate” transaction.
  • Sender Node and Target Node Same: If the sender node and the target node are the same, the investigator node should ignore the “investigate” transaction.
  • Collusion: Even if sender node and investigator node collude, they can’t remove the target node as it can issue a “node up” transaction.
  • Multiple Down Node Transactions: Use the “node down” transaction from the earliest cycle.
  • Rogue Investigator Node Sending Multiple Down Node Transactions: Prioritize the “down node” transaction with the lowest S value. The “S value” refers to a specific, sortable attribute used to prioritize or differentiate among transactions. In contexts where multiple transactions such as “down” transactions are compared, the transaction with the lowest “S value” should be prioritized or considered as the authoritative version.
  • Avoiding Excessive Gossip: The investigator node doesn’t gossip if it finds the target node is active, to prevent unnecessary network traffic.

Conclusion

Lost node detection is a unique mechanism that is designed to efficiently and effectively detect and manage active nodes that become unresponsive or lose connectivity. This technology, consisting of eight key steps, efficiently identifies and manages nodes that lose connectivity or become unresponsive, marking them as ‘lost’ without premature penalization. Nodes can restore their active status by submitting a transaction assuming they resume responsiveness or are wrongly labelled as lost. The lost node detection process is designed to maintain network integrity and performance by allowing these ‘lost’ validator nodes to regain active status, all without imposing harsh penalties, enhancing the network’s resilience and adaptability.


4
The Shard

Sign up for The Shard community newsletter

Stay updated on major developments about Shardeum.