My friend Joab Jackson has written a good summary of LLNL’s Hyperion effort at Government Computer News
One of the biggest challenges for managers of high-production computer systems is testing new code or hardware before it is rolled out across thousands of nodes. Something may work fine on a single server, but it could collapse as it is extended across many more nodes. Compounding the problem is that supercomputer cycles are so in demand these days that getting time on the big iron for testing purposes isn’t very likely.
One possible solution to this challenge is a new, modestly sized supercomputer being put into place by the Energy Department’s Lawrence Livermore National Laboratory (LLNL). The system, dubbed Hyperion, will have 1,152 nodes with 9,216 processor cores, which should give the machine the ability to execute about 100 trillion floating-point operations per second (TFlops).
I guess whether 100 TFLOPS is “modest” is all down to perspective. Joab’s point of reference was RoadRunner.
“The engineering guys typically don’t have budgets for computers, and we want them to have access to this stuff so they can test [their work] at scale,” he continued. Seager noted that too many times a piece of hardware or software won’t scale when it’s introduced into a production environment, and the operational folks will have to troubleshoot the system while it is running production jobs. “Building this testbed will provide these resources upfront,” he said.
In exchange for their participation (and donation of equipment), the vendors own a portion of the machine which they get to use for their own testing. In addition LLNL will be using it for their own tests
One early round of testing will determined whether to use Infiniband or Ethernet transport for storage area networks. Hyperion will have two storage area networks, one based on Infiniband and one based on 10 Gigabit Ethernet.
The lab will also use the system to test and refine the Lustre file system, an open source version of Infiniband, a cluster version of Red Hat Enterprise Linux, as well as various other cluster software used by Energy Department labs.
One of the early software testers using the system is Stanford University, who are developing a hypersonic aircraft modeling code
They were able to do a test run on Hyperion to ensure it will be ready when they get a time slot on one of the production machines that operate on behalf of the Energy Department’s Accelerated Strategic Computing Initiative