Welcome to the insideHPC Guide to Production Supercomputing and Systems Management. This 5 part article series will explain how a properly managed HPC systems will deliver sustained Return on Investment for supercomputing programs.
With High Performance Computing (HPC) supercomputer systems that comprise tens, hundreds, or even thousands of computing cores, users are able to increase application performance and accelerate their workflows to realize dramatic productivity improvements.
The performance potential often comes at the cost of complexity. By their very nature, supercomputers comprise a great number of components, both hardware and software, that must be installed, configured, tuned, and monitored to maintain maximum efficiency. In a recent report, IDC lists downtime and latency as two of the most important problems faced by data center managers.
Downtime happens. A study of 584 U.S. based datacenter professionals found that 91 percent of datacenters experienced an unplanned datacenter outage in the past 24 months.”
While HPC has its roots in academia and government where extreme performance was the primary goal, high performance computing has evolved to serve the needs of businesses with sophisticated monitoring, pre-emptive memory error detection, and workload management capabilities. This evolution has enabled “production supercomputing,” where resilience can be sustained without sacrificing performance and job throughput.
Production supercomputing can be roughly defined as the convergence of HPC and high performance analytics. IT departments expect their HPC system to run smoothly with all major elements – hardware, software, and networking – totally integrated.
Unplanned data center outages are expensive, and the cost of downtime is rising, according to a new study from the Ponemon Institute. The average cost of a data center outage has steadily increased from $505,502 in 2010 to $740,357 today (or a 38 percent net change)”.
In today’s datacenters, the overall cost of IT operations has to be considered as well. Thanks to sophisticated monitoring and power capping capabilities, system management software can ensure that your HPC system runs as efficiently as possible.
Over the next few week this article series will explore:
- How to Control Your Supercomputing Programs
- Supercomputer Systems Monitoring and Management
- Supercomputer Power Management
- Review of the SGI Management Suite