In the pantheon of HPC grand challenges, weather forecasting and long term climate simulation rank right up there with the most complex and computationally demanding problems in astrophysics, aeronautics, fusion power, exotic materials, and earthquake prediction, to name just a few. This special reports looks at how HPC takes on the challenge of global weather forecasting and climate research.
The Open Compute Project partners with leading CPU vendors such as Intel, AMD and ARM-based vendors to create reference designs that may be used by board and system vendors. These designs are bare-bones systems, with expansion options designed in for other types of I/O and storage. The reference design from Intel (REF) is 6.5 inches wide and 20 inches deep. These dimensions allow for three servers to be placed side by side in a newly designed Open Compute rack, increasing density.
This week we look at various attributes including how easy it is to scale Lustre file systems. The inherent scalability of Lustre aggregates storage capacity across many servers. I/O bandwidth also scales as more storage servers are added, and can be dynamically adjusted as needs change and demands for more storage capacity and bandwidth grow.
Make sure you use Cloud services that are designed for HPC applications including high-bandwidth, low-latency networking, exclusive node use, and high performance compute/storage capabilities for your application set. Develop a very flexible and quick Cloud provisioning scheme that mirrors your local systems as much as possible, and is integrated with the existing workload manager. An ideal solution is where your existing cluster can be seamlessly extended into the Cloud and managed/monitored in the same way as local clusters. Read more from the insideHPC Guide to Managing HPC Clusters.
Heterogeneous hardware is now present in virtually all clusters. Make sure you can monitor all hardware on all installed clusters in a consistent fashion. With extra work and expertise, some open source tools can be customized for this task. There are few versatile and robust tools with a single comprehensive GUI or CLI interface that can consistently manage all popular HPC hardware and software. Any monitoring solution should not interfere with HPC workloads.
Smaller clusters often overload a single server with multiple services such as file, resource scheduling, plus monitoring/management. While this approach may work for systems with fewer than 100 nodes, these services can overload the cluster network or the single server as the cluster grows. InsideHPC Guide show a plan for scalable HPC cluster growth
HPC systems rely on large amounts of complex software, much of which is freely available. There is an assumption that because the software is “freely available,” there are no associated costs. This is a dangerous assumption. There are real configuration, administration, and maintenance costs associated with any type of software (open or closed).