Intel has been careful to label the Xeon Phi as a coprocessor, something that always pairs with a Xeon CPU. But how does their performance compare on real applications? Over at the Xcelerit Blog, Paul Sutton benchmarks both devices using an optimized parallel version of the Monte-Carlo LIBOR swaption portfolio pricer.
It is executed once on the host CPUs (the Sandy Bridge processors), and again on the Xeon Phi co-processor in offload mode. The execution time of the full application is measured, including data transfers, random number generation, and reduction. All these steps are running on the target processor.
As we can see, from about 100K paths onwards, the Intel Xeon Phi becomes faster than the Sandy Bridge processors, reaching nearly 3x at 1M paths. With lower numbers of paths, the Sandy Bridge outperforms the Phi. This can be explained by the added data transfers and the comparably low level of parallelism for a low number of paths (considering both vectorization and multi-threading). The setup time for the random number generator also becomes more dominant on the Xeon Phi when there is relatively little computation performed.
Over at the Phi Musings blog, Dr. Stuart Midgley from Downunder Geosolutions has been documenting the process of getting Intel Xeon Phi coprocessors working in a cluster.
If you want to run native binaries on the Phi, their are a number of serious issues.
launching your binary
bandwidth to your application
authentication onto the phi
to list a few. Intel provides a way to launch native applications called micnativeloadex which is almost what you want. It copies the phi application to the phi, along with the necessary shared libraries, launches it and maps back stdout and stderr to the host. It does NOT map stdin from the host to the application… which was a show stopper for us. We use unix pipes a LOT and must be able to feed data in via stdin.
While Midgley is early on in his journey to coprocessor harmony, I think it will be fun to follow along as he goes. Read the Full Story.
The Genome Analysis Center (TGAC), one of seven institutes that receives funding from the UK’s Biotechnology and Biological Sciences Research Council (BBSRC), has deployed two Convey HC-1ex hybrid-core systems for advanced genomics research.
TGAC, based in the UK, is an aggressive adopter of advanced sequencing and IT technology. The two Convey HC-1ex systems are the latest addition to TGAC’s powerful computing infrastructure. By installing hybrid-core Convey HC-1ex systems, TGAC expanded its cluster and ccNUMA-based HPC environment to include leading-edge heterogeneous computing capabilities.
We need to analyse data quickly and precisely, which takes time on clusters,’ explained Mario Caccamo, deputy director of TGAC. “We offloaded some of our sequence alignment demand to the Convey hybrid-core systems, because they can handle the alignment algorithms much more efficiently. Using the Convey systems, we are seeing up to 15 times acceleration on our computationally intense BWA runs.”
TGAC was part of an international team that recently demonstrated next-generation sequencing could be used effectively to fine map genes in polyploidy wheat. TGAC will leverage Convey’s architecture to accelerate computationally challenging jobs, such as resequencing alignment for wheat and other polyploidy species.
The initial performance jump is a major improvement,” continued Caccamo. “We expect to achieve even better performance as we gain experience using the Convey platform.”
In this guest feature, Intel’s John Hengeveld reviews the past year and looks ahead to the industry challenges HPC is facing 2013.
Happy New Year Everybody! For me, 2012 was very exciting and very stressful. On the one hand I had family engagements, graduations, the launch of Intel® Xeon® E5 processors, the launch of Intel® Xeon Phi™ brand and first products, strong competitive moves in the industry. On the other hand I dealt with my illness, my brother-in-laws accidental death, and the aforementioned launches and new products.
I started 2012 by predicting that it would be the year of “Practical Petascale” and expected 20 petascale class machines – I under-called by 3 – and that they would be working on real applications (they are). I predicted we would start to see the technology gnomes cranking on the dawn of the exascale era. We saw Intel, nVidia and IBM all make a statement about what the next step in exascale would look like. Intel made some key acquisitions and delivered the Intel® Xeon Phi™ products. I am excited Intel announced these coprocessors reached generally availability on 1/28. So now, pretty much anybody can get one from his or her favorite OEM.
Intel Sr. VP Diane Bryant announced our product line at supercomputing.
I mentioned my four challenges to exascale – Programmability; Reliability; Efficiency, and System Scalability (PRESS) and we made very visible headway on all but Reliability. The OpenMP standard moved forward on a solid standard for attached co-processing. According to the top500 list, the industry has substantially improved its performance per watt. System Scaling solutions are starting to coalesce. On the Reliability front, there have been a few items of interest, but I haven’t seen as much as I think we need.
2013 is shaping up to be a corker in technical computing with more new products from Intel and others, and major new system deployments globally. There will be maybe 50+ Petascale systems – maybe more.
The biggest challenges to come this year:
The industry has been going at a breakneck pace for the past couple of year. I expect this to continue through 2013; I am worried that the software industry is falling behind in capabilities and services.
I expect that this year will see much greater convergence and intersection between the role of the workstation in visualization and design and the role of HPC in simulation and modeling. This fact alone should expand the technical computing markets, but we still need to converge on means for cloud access and standards for clusters relationships with workstations.
I think that industrial investment will pick up substantially. Competition requires computation. And Big Data Analytics will grow beyond the initial Hadoop models into something much more powerful in the long term. Defining that standard will be a big challenge as well.
We had better see more traction on the system reliability front.
Quite a year last year – An Amazing year this year. I love this industry. I really do.
This week SGI announced that the company has developed new software tools that enable customers and software developers to get the most value from Intel Xeon Phi coprocessors.
SGI UPC (Unified Parallel C) compiler, the first UPC compiler for Intel Xeon Phi, supports MPSS, the coprocessor software stack. It enables PGAS programming on SGI servers running Intel Xeon Phi. SGI UPC supports applications in native and offload modes. SGI MPInside, an advanced profiling and performance analysis tool that helps developers find bottlenecks in MPI code, now also runs on Intel Xeon Phi. SGI MPInside provides developers key capabilities to improve MPI application performance enabling “what-if” studies to project how code will perform on future architectures.
HPC customers require technology not only to deliver the best processing and energy efficiency, but also to speed advanced codes and algorithms to deployment,” said Raj Hazra, Intel VP and GM of the Technical Computing Group. “SGI’s UPC compiler is leveraging the familiar programming model of Intel Xeon Phi coprocessors. This allows customers to instantly take advantage of Intel’s new many-core technology when reusing the existing code and to achieve expected increase in performance.”
The Department of Energy’s Environmental Molecular Sciences Laboratory has ordered up a 3.4-petaflop supercomputer from Atipa Technologies, the HPC division of Microtech Computers. The new system will replace the Chinook supercomputer which aids energy, environment and basic science missions important to DOE.
The 42-rack machine will boast a total of 195,840 cores, consisting of 23,000 conventional Intel Xeon processors tied to 184,000 gigabytes of memory. The 1,440 compute nodes will also have an undisclosed number of Xeon Phi coprocessing cards alongside the Xeons, allowing the system to parallelize up to 120 extra calculations. A shared parallel filesystem will offer 2.7 petabytes of usable storage, across an FDR Inifiniband network. In total, there will be 128 GB of memory per node. What sets the new supercomputer apart, Atipa said, is the amount of memory devoted to each CPU, allowing the models that scientists run to operate more efficiently. For comparison, the recently completed “Stampede” supercomputer at the University of Texas also relies on just over 184,000 gigabytes of memory, including 204,900 cores split between a number of 8-core Intel Xeon E5-2680 microprocessors.
Today Matrox announced a low-cost addition to its Supersight family of industrial imaging computers that leverage the power of multi-core CPU, GPU, and FPGA technologies. Available in a 4U chassis, the news Matrox Supersight Solo lets OEMs and systems integrators maximize compute density in a 4U chassis with up to thirteen PCIe 2.0 x16 slots and dual PCIe 2.0 x16 host interfaces.
This new addition to the Matrox Supersight™ family lets developers design cost-effective imaging systems using a lifecycle managed platform that minimizes the need for revalidation and provides consistent long term availability,” said Michael Chee, product manager at Matrox Imaging. “We have also taken the occasion of this new product introduction to pass along recent production cost savings on the original, multi-node Supersight, which reduce the price by over 25%.”
The new Matrox Supersight Solo systems will be available in Q2 2013. Read the Full Story.
Over at ComputerWorld UK, Richard Fichera from Forrester Research writes that Intel will accelerate the adoption of Xeon Phi explicit parallel coprocessors with lower barriers to application migration.
Eventually, possibly a couple of successive CPU generations down the road, we may see the MIC architecture wedded to the Xeon memory space via an extension of the QuickPath architecture, much the same way that the AMD Fusion architecture couple the GPU components in their integrated APUs. On the way, Intel will introduce more scalable MIC products, and their immense leverage with their OEM partners will ensure the rapid development of a robust MIC ecosystem in terms of tools, supported ISV solutions and trained developers.
Scaling CFD and UQ codes on Sequoia. Ivan Bermejo-Moreno, Sanjeeb Bose, Joe Nichols, Curtis Hamman, Francisco Palacios and Julien Bodart, Stanford University Predictive Science Academic Alliance Program (PSAAP) and Center for Turbulence Research
Programming Models and their Designs for Exascale Systems. Dhabaleswar K. Panda, Ohio State University
Energy Efficiency and its Impact on Requirements for Future Programming Environments. John Shalf, Lawrence Berkeley National Laboratory
The RAMCloud project. Ankita Kejriwal, Stanford
Charm++: HPC with migratable objects. Laxmikant Kale, University of Illinois at Urbana-Champaign
The future of network-based storage. Brent Gorda, Intel
The event is free to attend and includes lunch on both days. Register now.
In this video with the unfortunate thumbnail, Taylor Kidd from Intel presents an introduction to the hardware architecture of the Intel Xeon Phi coprocessor.
This module covers the intent of the workshop, the type viewer the workshop is aimed at, a brief look at the hardware architecture of the Intel Xeon Phi coprocessor, the SW stack, and programming models. Briefly looks that the roadmap forward for the Intel Knights products. It discusses the software development platform, documentation, and use of Intel Premier Support. It sets expectations on the capabilities and usage models that are appropriate for the Intel Xeon Phi coprocessor . And lastly, looks at brief example of the advantages of the 512-bit vector engine.
In this video from the Intel Xeon Phi announcement at SC12, Dr. Dan Duffy at NASA Goddard describes the installation of his IBM iDataPlex M4 servers. Using the IBM Intelligent Cluster process, his team was able to complete the installation in 48 hours as well as a Linpack run that landed them at number 52 on the TOP500 supercomputer list.
The test problem is a basic N-body simulation, which is the foundation of a number of applications in computational astrophysics and biophysics. Using common code in the C language for the host processor and for the coprocessor, they benchmark the N-body simulation. The simulation runs 2.3x to 5.4x times faster on a single Intel Xeon Phi coprocessor than on two Intel Xeon E5 series processors. The performance depends on the accuracy settings for transcendental arithmetics. They also study the assembly code produced by the compiler from the C code. This allows to pinpoint some strategies for designing C/C++ programs that result in efficient automatically vectorized applications for Intel Xeon family devices.
In this podcast, the Radio Free HPC team is still talking about the recently concluded SC12 conference in Salt Lake City. The conversation starts with a short review of Thanksgiving dinner (including disgusting eating noises added in at no additional charge) before moving on to more weighty topics such as Intel’s formal introduction of their Xeon Phi coprocessor, including some performance and price information.
Rich and Henry think that Intel has a strong hand with Phi, but Dan isn’t so sure…
Over at Dr. Dobbs, author Rob Farber writes that both CUDA and Phi coprocessors provide high degrees of parallelism that can deliver excellent application performance. But what if your code is already on CUDA?
To run on Intel Xeon Phi coprocessors, CUDA kernels need to be modified. At the moment, this needs to be done by hand. While it is technically possible to run CUDA on Phi coprocessors, products such as CUDA-86 do not currently generate code for these devices. An OpenCL compiler for the Intel Xeon Phi coprocessor is coming. This means that CUDA programmers can consider Wu Feng’s CU2CL CUDA-to-OpenCL source translator to port their code. In the future, an LLVM translation project might be able to create executable code for the Phi.
This is a deep-dive feature story that is well-worth a look. Read the Full Story.
The latest version of Moab was designed to recognize and work with the new Intel Xeon Phi coprocessors, based on the Intel Many Integrated Cores (MIC) technology. This ability to automatically detect Intel Xeon Phi coprocessors– and determine their location and availability — improves processor utilization to more intelligently schedule jobs and removes the need for extensive reprogramming to integrate Intel Xeon Phi coprocessors into existing systems. It also allows for policy-based scheduling, optimizing the choice of accelerators and coprocessors. As Intel Xeon Phi coprocessors are introduced into existing systems, this keeps costs and management efforts at a minimum, while maximizing utilization to ensure the most efficient job processing — by utilizing metrics including the number of cores and hardware threads, physical and memory available (total and free), max frequency, architect and load.”