RoCEv2 Lossless iFabric Solution
Distributed storage relies on high-bandwidth, low-latency deterministic networks, and requires the use of FC networks, InfiniBand(IB) networks and other technologies. Maipu NSS18500/NSS7830/NSS5950/5930/5830 data center switches are based on lossless Ethernet(RoCEv2) technology and have been successfully applied in data center networks of banks, energy, and governments, proving that lossless Ethernet(RoCEv2) technology has the ability to replace traditional InfiniBand(IB) technology in the storage networks of some industry customers.
The evolution and challenges of solid-state and distributed storage
The widespread use of solid-state drives(SSDs), especially the emergence of NVMe , an interface specification optimized for NAND flash and solid-state storage technology, has further unleashed the performance potential of SSDs. The success of NVMe technology has promoted the development of semi-flash/full-flash high-speed distributed storage, allowing enterprises to cope with the challenges of exponential growth of data in the 5G and AI eras, while efficiently managing and storing data while improving latency, reliability, and performance. Especially in the field of intelligent computing, traditional storage can no longer meet the needs of large models with tens of billions or hundreds of billions of parameters for high throughput and low latency. It has become an inevitable choice to use NVMe distributed storage matrix to provide high-performance metadata storage services and transmission capabilities to computing clusters.
SSD interface 4K IOPS
As shown in the above figure, the performance of SSD storage has improved rapidly. The IOPS (read and write times per second) of SAS SSD is 2.39 to 3.39 times higher than that of SATA. NVMe has further significantly improved the read and write performance. The IOPS of NVMe SSD is 1.99 to 2.24 times higher than that of SAS SSD and 7.6 times higher than that of SATA. With the evolution of PCIe interface standards, the throughput performance of NVMe will continue to improve. The interface bandwidth of PCIe4.0 has increased by 1 times compared with PCIe 3.0. In the PCIe 5.0 era, the interface speed of NVMe SSD will be further increased to 26 times that of SATA SSD interface. NVMe technology has almost filled the huge performance gap between storage and computing caused by the slow development of storage in the past decade, but it has also put forward higher requirements for high-speed storage networks.
Challenge 1 - High Throughput
Thanks to the replacement of SATA interface + AHCI by PCIe interface + NVMe , NVMe SSD can improve throughput along with the development of PCIe:
Bandwidth Differences Between PCIe and SATA
Based on the current mainstream PCIe 4.0, the throughput of a single lane can reach 1969MB/s, so the theoretical bandwidth of PCIe 4.0 x4, x8 and x16 is 7.8GB/s, 15.7GB/s, and 31.5GB/s. If the above data needs to be obtained from the network, the theoretical network bandwidth requirements are 62.4Gbps, 125.6Gbps, and 252Gbps respectively. In the future, the network bandwidth requirements of PCIe 5.0 will be higher. The theoretical network bandwidth requirements of PCIe 5.0 x8 and x16 are 196.8 Gbps and 393.6G respectively and can reach more than 1Tbps in extreme cases.
Challenge 2 - Low Latency
The biggest latency change from HDD to SSD is caused by the medium, that is, SSD NAND reacts much faster than mechanical hard disks:
From traditional HDD SATA to SSD NVMe , latency has rapidly dropped from seconds to microseconds (μs). High-speed storage networks need to have latency performance that is better than that of storage systems to match the requirements of SSD NVMe .
Based on the above two challenges, the current high-speed storage network needs to be built based on a high-throughput, low-latency lossless network. This is why high-speed storage networks have relied on IB solutions in recent years .
Maipu Lossless Ethernet (RoCEv2) Solution Comparison Test
Maipu, in collaboration with a domestic distributed storage manufacturer, used the NSS5950 series 100G lossless Ethernet switch to conduct a horizontal performance comparison with IB switch products in this scenario based on lossless Ethernet technology to determine whether Maipu's lossless Ethernet switching products can meet the stringent requirements of the NVMe switching matrix in large-capacity, high-performance distributed storage scenarios .
Test Environment:
The CPUs in the storage Node cluster are allocated to Chunk Server (CS) and Frontend (FE) in a ratio of 5:2. After storage PGO optimization, sequential read or random read and write operations of different sizes are performed. The IO is considered successful only when all three replicas are fully recovered in terms of IOPS and latency.
IOPS Test Data Results
Testing Result 1:
A strongly consistent 3-copy system that combines full-process TRIM, QoS, 16KB atomic write and other technologies divides large write data into smaller requests in Frontend and sends them to different storage nodes and different CPU cores respectively, which can effectively control the size of network synchronization data packets within a specific range (8K~128K).
Testing Result 2:
In the supporting network solution, although the traditional IB solution can have good throughput and latency in a wider range of full-byte sizes, in the 4K~512K block size segment, the use of Maipu NSS5950 series 100G lossless Ethernet (RoCEv2) can also achieve good lossless transmission performance, and even in some segments it performs better than the IB solution (IOPS increased by 2%~20%, latency reduced by 2%~32%).
Testing Result 3:
Based on the aforementioned data processing enhanced striping characteristics, combined with the RoCE-based NVMe over Fabrics high-determinism network, latency can be further reduced and overall throughput performance can be improved. While greatly improving the network cost-effectiveness, the μs-level low-latency Maipu lossless Ethernet switch can fully improve the performance of the distributed storage cluster to more than 3 million IOPS random read and write, system average latency <300us, and P99 latency <800us.
The above experimental results show that:
In the above specific scenarios and equipment environments , the performance of Maipu NSS5950 lossless Ethernet switching products is no less than that of IB products.
Maipu's highly deterministic Lossless iFabric lossless storage network
The distributed storage network uses Maipu's Lossless iFabric intelligent lossless storage network solution. Based on NVMe over RoCEv2 , the solution provides a lossless end-to-end high-performance storage network solution for all-IP data center storage scenarios, which can ensure ultra-high IOPS and ultra-low latency of the storage system, break the performance bottleneck of the storage network, and fully release the performance advantages of the all-flash array, providing unparalleled high performance and high certainty for the NVMe storage network.
Maipu RoCEv2 Data Center Switch Family
Extreme IOPS, ultra-high throughput
Supports NVMe over RoCE. Compared with traditional FC storage, IOPS is greatly improved and latency is further reduced by more than 20%, fully unleashing the performance of NVMe storage.
Zero packet loss/low latency, ultimate performance
Priority-based Flow Control(PFC) and Explicit Congestion Notification(ECN) control algorithms avoid packet loss and message retransmission that may occur in traditional Ethernet protocols when encountering network congestion, reduce network latency and jitter, and thus achieve higher performance and lower latency.
Intelligent iNOF, efficient and reliable
Based on NOF (Intelligent Lossless NVMe Over Fabric), it realizes automatic discovery and rapid perception of massive storage devices, and can realize host notification at the millisecond level, greatly simplifying the difficulty of operation and maintenance and improving operation and maintenance efficiency. It supports iNOF multipath and iNOF active-active technology, quickly perceives link failures and port failures, discovers network failures in seconds and implements fault path switching, and enhances the high reliability of storage infrastructure.
Independent ecological environment, supply security
Maipu NSS18500/NSS7830/5950/5930/5830 series lossless Ethernet switches implement full-stack independent solutions in the form of NVMe and are compatible with mainstream domestic and foreign servers, operating systems OS and NICs, providing users with a safe, stable and reliable data center infrastructure network.