Titan Supercomputer Users Enjoy Increased Stability

supercomputer-150x150Over at OLCF, Jeff Gary writes that the Oak Ridge Leadership Computing Facility’s Titan supercomputer has overcome a challenging launch and is now showing impressive stability.

The machine is running very well,” said Don Maxwell, task lead in the OLCF’s HPC Operations Group. “Node failures are on par with what we’d expect. Things are going very well. We’ve only experienced one unscheduled outage in over five months and no unscheduled outages in 2014.”

After Titan was delivered, two rounds of rolling repairs were untaken. The second was completed on December 17. For that phase, about 20 percent of the machine was taken off-line at a time. Since repairs were completed, the machine has been very stable and heavily utilized.

The fact these challenges presented themselves is not unexpected, said OLCF Project Director Buddy Bland. “As we’ve seen many times with very large, first-of-a-kind systems, you’re likely to find abnormalities and manufacturing defects that might never be found anywhere else, just because there are so many different parts from so many different places.”

Increased stability has laid the foundation for higher user productivity. Since January 1, 2014, OLCF users have completed 110,587 jobs on Titan and have used 1,611,330,832 core hours. The 2014 INCITE projects are off to a fast start and have collectively used a higher percentage of their allocation than ever before at this point in the allocation cycle.

Read the Full Story.