Modern HPC installations that are designed for massive amounts of computing need to evaluate and understand the networking requirements as well. The more distributed the application may require very fast and integrated networking between node to achieve the anticipated time to completion.
Environments with the number of nodes in the 100 to 10,000 range may be sufficiently serviced by existing networking products. As the node count increases over 10,000, scaling issues may surface, especially as the message passing requirements increase as well. This is quite important as systems start to be designed to handle performance in the exaflop range. All aspects of designing a large system that can perform at this level need to be considered.
The Intel Omni-Path Architecture is an example of a networking system that has been designed for the Exascale era. There are many features that will enable this massive scaling of compute resources. Features and functionality are designed in at both the host and the fabric levels. This enables very large scaling when all of the components are designed together. Increased reliability is a result of integrating the CPU and fabric, which will be critical as the number of nodes expands well beyond any system in operation today. In addition, tools and software that have been designed to be installed and managed at the very large number of compute nodes that will be necessary to achieve this next level of performance.
Features that will be critical in this next age of computing will involve adaptive routing, such that if there is congestion in one part of the network, the information can be re-routed. As these large systems become operational, the routing software will be required to determine and configure the most optimum routes from one node to another. Traffic flow optimization is also important to maintain high performance, such that high priority traffic can interrupt lower priority traffic and the system will continue to run smoothly, and as expected.
The next generation of systems will have to deal with challenges that are just beginning to be dealt with , and include a wide range of technologies. It is important to look and investigate the node designs, the processor technology, the software for optimization as well as the networking choices and infrastructure.