“Managing the work on each node can be referred to as Domain parallelism. During the run of the application, the work assigned to each node can be generally isolated from other nodes. The node can work on its own and needs little communication with other nodes to perform the work. The tools that are needed for this are MPI for the developer, but can take advantage of frameworks such as Hadoop and Spark (for big data analytics). Managing the work for each core or thread will need one level down of control. This type of work will typically invoke a large number of independent tasks that must then share data between the tasks.”
Remote visualization tools allow employees to dramatically improve productivity by accessing business-critical data and programs regardless of their location. Remote visualization technologies allow users to launch software applications on the server side and display the results locally, letting them leverage the bandwidth and compute power of the cluster while circumventing the latency and security risks of downloading large amounts of data onto their local client.
With modern processors that contain a large number of cores, to get maximum performance it is necessary to structure an application to use as many cores as possible. Explicitly developing a program to do this can take a significant amount of effort. It is important to understand the science and algorithms behind the application, and then use whatever programming techniques that are available. “Intel Threaded Building Blocks (TBB) can help tremendously in the effort to achieve very high performance for the application.”
Applications such as machine learning and deep learning require incredible compute power, and these are becoming more crucial to daily life every day. These applications help provide artificial intelligence for self-driving cars, climate prediction, drugs that treat today’s worst diseases, plus other solutions to more of our world’s most important challenges. There is a multitude of ways to increase compute power but one of the easiest is to use the most powerful GPUs.
“The Intel Omni-Path Architecture is an example of a networking system that has been designed for the Exascale era. There are many features that will enable this massive scaling of compute resources. Features and functionality are designed in at both the host and the fabric levels. This enables very large scaling when all of the components are designed together. Increased reliability is a result of integrating the CPU and fabric, which will be critical as the number of nodes expands well beyond any system in operation today. In addition, tools and software that have been designed to be installed and managed at the very large number of compute nodes that will be necessary to achieve this next level of performance.”
Here’s a recap of SC16 announcements from Intel that are designed to provide even more powerful capabilities to address HPC challenges like energy efficiency, system complexity, and the ability for simplified workload customization. In supercomputing, one size certainly does not fit all. Intel’s new and updated technologies take a step forward in addressing these issues, allowing users to focus more on their applications for HPC, not the technology behind it.
Libraries that are tuned to the underlying hardware architecture can increase performance tremendously. Higher level libraries such at the Intel Data Analytics Acceleration Library (Intel DAAL) can assist the developer with highly tuned algorithms for data analysis as well as machine learning. Intel DAAL functions can be called within other, more comprehensive frameworks that deal with the various types of data and storage, increasing the performance and lowering the development time of a wide range of applications.
Have you ever wondered why your HPC installation is not performing as you had envisioned ? You ran small simulations. You spec’d out the CPU speed, the network speed and the disk drive speed. You optimized your application and are taking advantage of new architectures. But now as you scale the installation, you realize that the storage system is not performing as expected. Why ? You bought the latest disk drives and expect even better than linear performance from the last time you purchased a storage system. Read how you can get increased efficiency of your storage system.
“When designing an application that contains many threads and less cores than threads, it is important to understand what is the optimal number of threads that should be assigned to a core. This value should be parameterized, in order to easily run tests to determine which is the optimum value for a given machine. One thread per core on the Intel Xeon Phi processor will give the highest performance per thread. When the number of threads per core is set at two or four, the individual thread performance may be lower, but the aggregate performance will be greater.”
As data center sprawl is now understood to be expensive and may not deliver performance increases for all types of applications, new technologies are coming to the rescue. A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturing – hence “field-programmable”. While the use of GPUs and HPC accelerators are generally understood today, there are a number of misconceptions about FPGAs that need to be understood.