Shardeum Milestone: Restoring the Dynamically Sharded Network
Shardeum's groundbreaking architecture enabled it to easily overcome a major betanet crash making it the first sharded blockchain network to self-restore all its...
Shardeum's groundbreaking architecture enabled it to easily overcome a major betanet crash making it the first sharded blockchain network to self-restore all its...
On Monday, January 27th 2024, a groundbreaking event unfolded in the world of web 3 that can indeed be likened to the historic feat of a spacecraft flawlessly returning to its launch pad after its first test flight mission. In this extraordinary scenario, Shardeum, not only faced a formidable challenge but also emerged triumphant, with its network resilience marking the first time a sharded network has self-restored in the realm of distributed ledger technology.
Just as a spacecraft’s journey involves meticulous planning, precision engineering, and the seamless execution of complex maneuvers, Shardeum’s restoration of its Sphinx betanet, which had experienced a critical crash, required an equivalent level of technological mastery and innovation. The ability to preserve all data on a network, particularly one that operates using dynamic state sharding is nothing but groundbreaking.
As we embark on this exploration, we not only celebrate Shardeum’s momentous landing but also recognize it as a pivotal moment in the evolution of web 3 technology – a leap that could redefine the boundaries of IT network resilience and data integrity.
Maintaining and restoring a dynamically sharded network, such as Shardeum, encompasses a spectrum of complex technical challenges that set it apart from traditional blockchain networks such as Bitcoin or Ethereum. In a dynamically state sharded environment with autoscaling, the continual reallocation and balancing of nodes and resources across different shards are crucial for optimizing performance and scalability. This constant flux in the network architecture adds significant complexity in maintaining data consistency, ensuring network stability, and facilitating effective failure recovery.
The criticality of these challenges is accentuated when contrasting Shardeum’s response to node fluctuations with that of Bitcoin. The Bitcoin network retains functionality even with a minimal number of nodes, as each full node possesses the complete state and transaction history. Conversely, each active node on Shardeum does not possess the complete state and transaction history, as the Shardeum network is sharded, and each validator only holds a subset of the entire state. The consequence of this sharding is that all validator nodes are extremely lightweight. This, therefore, creates a plethora of engineering opportunities and challenges. If a node goes down, how do we ensure all the data is preserved? Shardeum has two major ways.
Firstly, Shardeum uses dynamic state sharding, in which the overall address space is partitioned (or divided) according to the number of active nodes. Each node is accountable for its assigned partitions, along with a certain radius (R) around it and extra partitions (E) adjacent to it, ensuring dynamic adaptability and robust data redundancy within the network’s framework. So even if a node goes down, there is still network continuity and no loss of data.
Secondly, Shardeum uses archiver nodes to store the complete state of the entire network. This is achieved by active nodes streaming their partially stored states to the archivers for collection. Due to these two factors and design optimizations, restoring such a network has to be designed in a novel way in order to still facilitate advantageous features such as autoscaling and linear scaling.
Now that we understand the basics of dynamic state sharding and that archiver nodes are somehow involved, let us first unravel some additional components more deeply and explain them. In order to understand the crash and the restoration of the Shardeum betanet, we must first understand a little about the following:
Understanding the basics of each of these is necessary before we dive head first in the bugs, so let’s take a look!
In Shardeum, archiver nodes, also called archivers, stand as a critical category of nodes, tasked with preserving the network’s full state and historical records. Distinguished from active nodes, archivers do not participate in consensus processes; their primary function is to comprehensively archive the entirety of the network’s data, including transactions and receipts. The contribution of archiver nodes is essential for upholding the network’s integrity and ensuring its seamless operation, thereby affirming Shardeum’s status as a robust, complete, and reliable network. Due to archivers being integral to its network, Shardeum must have a protocol to detect any unresponsive archivers (and validators).
Shardeum has a protocol called lost node detection protocol which detects when active nodes become non-operational – this is reserved solely for active nodes. However, Shardeum also has a protocol for archivers that does a similar thing called lost archiver detection. Lost archiver detection is a specialized protocol designed to handle rare scenarios wherein one or more archivers become non-operational and marked as lost. Because archiver nodes are pivotal in maintaining the integrity and accessibility of historical data within the network, it is therefore essential that on the rare occasion they are unresponsive or non-performant, that these critical events are caught in order to mitigate any downstream effects. Although archivers being lost did not cause this particular crash, the interplay between the lost archiver detection protocol and a specific network mode did. Let us now examine what network modes are on Shardeum.
A flagship innovation on Shardeum powered by the underlying Shardus protocol is the network modes framework. These modes transcend basic operational states, embodying an intricate coordination of varied node activities, data synchronization methods, and transaction management systems. Such network configurations play a crucial role in preserving the operational integrity of the network, especially in scenarios characterized by the loss of nodes and data – as Shardeum is a sharded network.
At a simpler level, the best way to conceive of network modes on Shardeum, is as hard-coded contingency plans that enable continuity of operations for the entire network – even in unlikely conditions such as the network crashing or network-wide degradation. This pre-programmed operational resilience and robustness ensures that Shardeum will always live on – no matter what adversity the network faces.
Although understanding the bug doesn’t require digesting every facet of the network modes framework, it does help to be acquainted with the basics. Central to the network modes framework is the incorporation of several distinct network phases: Forming, Processing, Safety, Recovery, Restart, Restore, and Shutdown. These modes are carefully crafted to address a variety of network situations and emergencies. However, the mode that concerns us in this article is recovery mode.
Recovery mode is 1 out of the aforementioned 7 network modes. Recovery mode is initiated when the network’s active node count descends below a predefined critical threshold (currently configured at 75% or lower). This threshold is adjustable as per network requirements. In this mode, the network temporarily halts the processing of application transactions and synchronizing application data. This strategy is designed to facilitate the network’s expansion by seamlessly cycling in standby nodes as part of node rotation, thereby restoring the active node count to an optimal level, ideally above 100%.
During recovery mode, Shardeum’s network architecture allows for an incremental increase in nodes, capped at 20% growth per cycle (each cycle approximating 60 seconds). This controlled growth rate is critical for maintaining network stability and ensuring seamless integration of new nodes. A rapid increase in node count, such as a 50% surge, could potentially destabilize the network and complicate the integration process.
Each newly added node necessitates network resources for synchronization and integration. By limiting the increase to 20% per cycle, the network ensures that its infrastructure can adequately support the addition of new nodes without strain. This approach not only preserves network stability but also minimizes the risk of data inconsistencies or errors during the synchronization process, thereby maintaining the integrity of the cycle chain data.
It is important to note that there were two distinct bugs. The neon library bug – which caused validators to crash seemingly at random and a bug in the lost archiver detection protocol – that would not accept an empty validator list. Although it is the lost archiver detection protocol bug which caused the crash of prevailing betanet version, I would like to take you through neon library bug first.
In version Sphinx 1.9.1, we integrated an update to a library that employs Neon bindings for bridging Rust and TypeScript functionalities as Shardeum is predominantly built in TypeScript. Neon is recognized for its innovative, albeit experimental approach, often pushing the boundaries of conventional software development practices. This integration was aimed at enhancing the interoperability between these two languages, enabling more efficient and direct communication within our software architecture. However, this caused a bug which caused nodes to randomly drop out the network.
Secondly, in the recent incident that led to the betanet crash at Shardeum, the root cause was identified as a critical anomaly originating from the interplay between the two distinct aforementioned subsystems: the lost archiver detection mechanism and the network recovery mode protocol.
This brief crash was precipitated by the simultaneous activation of these two mechanisms, a scenario not encountered or tested previously. The lost archiver process was triggered alongside the network recovery mod and due to a bug in the lost archiver mode not accepting empty active node list. This culminated in the network crash.
So what else actually happened and when? The timeline of events surrounding the network crash and its resolution is as follows:
Now we know about the timeline and how major events unraveled, let’s take a sneak peek into what occurred for the network to restore so swiftly.
The restoration of the network consisted of many parts, but one of the key ones was Shardeum’s recovery mode. As previously stated, recovery mode is initiated when the network’s active node count descends below a predefined critical threshold and allows rapid, controlled and effective network growth in secure fashion to restore the network. It is important to stress that without the technological ingenuity of the network modes designers and developers – Shardeum would not have been able to recover from the crash so easily and also display its innovative prowess.
Additionally, key efforts were made by Shardeum’s tech team as they launched into immediate action. The initial step involved a thorough analysis to identify the root cause of the crash, which was traced to an anomaly in the interaction between the network’s lost archiver detection and its recovery mode systems. Understanding the complexity of the issue, the team swiftly implemented a multi-faceted approach to address both the immediate fallout and the underlying vulnerabilities.
Technically, the response was multifaceted: firstly, the team isolated the affected components to prevent further network degradation. Concurrently, they deployed a patch to rectify the bug in the lost archiver system, ensuring it could handle an empty validator list- a critical oversight that had precipitated the network’s failure. To restore the network to its full operational capacity, data preserved on archivers was then activated and utilized to reconstruct the network’s state prior to the crash, ensuring no data loss in the process.
Logistically, the team coordinated across different time zones and disciplines, leveraging cloud-based tools for real-time collaboration and monitoring. This coordinated effort not only facilitated the rapid development and deployment of fixes but also ensured that all team members were aligned on the recovery process and next steps.
This incident served as a rigorous test of Shardeum’s crash management protocols and highlighted the importance of agile, innovative responses to unforeseen challenges. It underscores the team’s commitment to maintaining a resilient and secure network, prepared to tackle complex technical hurdles as they arise.
In conclusion, the successful restoration of Shardeum’s sharded network heralds a significant shift in network technology, marking a milestone with far-reaching implications for the industry. Although largely unrecognized currently, innovations like network modes will eventually set new industry standards across web 3.
It has been my long held belief that Shardeum’s core innovations are highly likely to influence future technological developments, inspiring innovations and new generations of ledger technologies. Having bore witness to Shardeum’s first-of-its-kind network restoration first hand, I know this will catalyze a reevaluation of industry standards, potentially leading to the adoption of more rigorous protocols and methodologies in network design and architecture.
This event not only showcases the technical prowess and innovation of Shardeum’s team but also signals the dawn of an era where decentralized networks are more robust, adaptable, and capable of handling unforeseen challenges with disaster recovery planning. Ultimately, Shardeum’s technology will herald a new era of decentralization.