In today’s highly competitive world, High Performance Computing (HPC) is a game changer. Though not as splashy as many other computing trends, the HPC market has continued to show steady growth and success over the last several decades. Market forecaster IDC expects the overall HPC market to hit $31 billion by 2019 while riding an 8.3% CAGR. The HPC market cuts across many sectors including academic, government, and industry.
The HPC in industry has seen steady growth due to competitive advantage and successful return on investment. Indeed recent numbers by IDC show that HPC across all sectors provided a $515 return on every HPC dollar invested. Clearly HPC has gone from a specialized back-room art to a strategic technology that all sectors must employ to remain competitive.
Some specific areas that have shown great benefit from HPC are:
- Higher Education
- Life Sciences
- Manufacturing
- Government Labs
- Oil and Gas
- Weather Modeling and Prediction
In the past, successful HPC programs often required domain experts and computer scientists to achieve successful operation. In today’s HPC environment project success and good return on investment (ROI) can be achieved with many off-the-shelf solutions. Understanding the issues, costs, and advantages of various technologies is paramount to capturing this success.
This the second article in a series on InsideHPC’s Guide to Successful Technical Computing
Understanding HPC Hardware
In many respects, modern HPC hardware is based on commodity processors and components. Present day economics have all but eliminated specialized processors from the market and almost all HPC systems are based on commodity processors from Intel, AMD, and IBM. According to the November 2015 Top500 list of the world’s fastest computers, the most used processor family (80% of all systems) were those from Intel.
Modern processors are composed of multiple cores which are themselves full processing units. Thus, each processor can run many tasks simultaneously and still provide full maximum performance for each task. In many HPC environments, accelerators or GPUs (e.g. the Intel Xeon Phi) are usually employed to speed-up HPC applications.
The total amount of usable memory is also important for many HPC applications. In most cases, HPC users will suggest “the bigger, the better” when asked about the size of system memory. Usable memory size is often an enabling factor for many large HPC applications. As an example, the maximum amount of usable memory for high-end commodity Intel Xeon HPC processors is currently 1.5 Terabytes.
The final aspect of most HPC systems is an efficient disk storage system. While many new storage sub-systems are using Solid State Disks (SSD) traditional Hard Disk Drive (HDD) based systems still offer the best storage per dollar and can provide excellent performance. As many HPC applications scale, they often require large amounts of storage for input data, intermediate results, and output data.
Scale-up vs Scale-out Defined
To meet their growing needs for workloads, IT organizations can expand the capability of their x86 server infrastructure by either scaling up by adding fewer more capable servers or scaling out by adding multiple, relatively less capable servers:
- Scale-up. Scale-up is achieved by putting the workload on a bigger, more powerful server (e.g., migrating from a two-socket server to a four- or eight-socket x86 server in a rack-based or blade form factor). This is a common way to scale databases and a number of other workloads. It has the advantage of avoiding the requirement for significant changes to the workload; IT managers can just install it on a bigger box and keep running the workload the way they always have.
- Scale-out. Scale-out refers to expanding to multiple servers rather than a single, bigger server. The use of availability and clustering software (ACS) and its server node management, which enables IT managers to move workloads from one server to another or to combine them into a single computing resource, represents a prime example of scale-out. Scale-out usually offers some initial hardware cost advantages (e.g., four two-socket servers may cost less than a single 16-socket server that is being replaced in the data center). And, in many cases, the redundancy offered by a scale-out solution is also useful from an availability perspective. However, IDC research has also shown that scale-out solutions can drive up Opex to undesirable levels. At the same time, large data volumes and required processing capabilities are taxing scale-out systems.
As organizations choose to entrust an increasing number of demanding, mission-critical workloads to scale-up servers, they must make sure that they are not introducing single points of failure at the server level for these critical applications. This means that customers should choose servers that incorporate advanced reliability, availability, and serviceability (RAS) features. RAS servers incorporate features that improve resilience by allowing processing to continue in the case of hardware failure, ensuring the attainment of agreed-upon uptime service levels, and providing improved diagnostic and error detection tools to enable them to be maintained more quickly and easily.
The next article in this series will discuss how to understand your HPC applications needs.
If you prefer you can download the complete insideHPC Guide to Successful Technical Computing, courtesy of SGI and Intel – Click Here.