Cray poised to grow market through the magic of ISV software

New version of CLE runs “any standard x86-based Linux application”

Cray logoToday at the IDC HPC User Forum Barry Bolding, Cray’s VP of scalable systems, introduced what I reckon to be the most significant strategic announcement the company has made since it moved to consolidate around the high end business shortly after Pete Ungaro stepped up from sales into the role of CEO and President several years ago.

On the surface, it doesn’t sound like much: today marked the release date for version three of the operating system Cray uses on its high-end XT systems, the Cray Linux Environment (CLE).

But the central purpose of this release is to bring to XTs the ability to run any standard x86-based application, removing a major objection to purchase often cited by the well-heeled customers of both Cray’s high-end XT and midrange XT*m systems. In fact, the move has already influenced at least one major acquisition: when I spoke with Bolding last week ahead of this announcement he mentioned CLE 3.0 and the ability to run ISV applications as a major factor in Cray’s recent $45M sweep of the DoD HPC Modernization Program.

Cray started its move toward ISV compatibility and broader market acceptance with the introduction of the CX1 and the recently added CX1000 products. The release of CLE 3.0 moves to close that circle from the performance end of the business, creating the beginnings of a path that will allow customers to move smoothly from a workstation in their office to multi-hundred thousand core supercomputers, without ever leaving the Cray family.

Computing, à la mode

The secret behind CLE’s newfound love for ISV applications is the introduction of a new group of features in the operating system called Cluster Compatibility Mode, or CCM. CCM allows nodes in an XT line supercomputer to run a fully standard x86 Linux (SUSE SLES 11 in this case) — applications simply install and run. Matlab on your XT6? You can do that.

CCM is contrasted with Extreme Scalability Mode, which until today was the only way to run an XT. In ESM a single application can span hundreds of thousands of cores, taking advantage of the CLE’s lightweight kernel for scalability, and the custom SeaStar interconnect and tuned communications libraries for speed. In CCM, however, applications are limited to 2,048 cores, and only have access to MPI on a TCP/IP stack for communication. Bolding describes CLE 3.0 as a “feature release,” meaning they were focused on getting ISV software onto Cray supercomputers. The next release, planned for next year, will be a “performance release,” with both larger numbers of cores made available to CCM applications and support for OFED (and thus InfiniBand).

Another key feature of CLE 3 is that datacenter administrators do not have to partition their machines into blocks of nodes dedicated to the two modes ahead of time: nodes can swap from one mode to the other via a user-settable parameter in the job submission script. According to Bolding under CLE 3.0 nodes run in ESM by default. When a user submits a job using software that requires CCM, the job dispatch system (called ALPS, in case you are curious) instructs the nodes to set up the standard Linux environment on the nodes, the job runs, and then those nodes are returned to ESM before being returned to the compute pool. Bolding says that the time to set up CCM on a pool of nodes is “only a few seconds.”

There is another nice feature of CLE 3.0 that you have to be in HPC center management to care about, which I mention because I used to be in center management: you can finally run a networked license manager for your software without having to do a backbend and perform a ritual sacrifice. About time.

Something for everyone

Although the most significant strategic feature of CLE 3.0 is CCM and its ability to run standard Linux applications, Bolding says there is quite a bit baked into this release for the high end as well. Much of this is there to support Baker, Cray’s next-generation high-end HPC platform, and the company is delaying discussion of most of those features until that platform launches later this year.

We do know that this release scales support to machines with “more than 500k cores,” up from the 250,000 core machines supported in the last version of CLE. Lustre 1.8 is also supported, as are numerous reliability enhancements like warm swap of blades and link resiliency for Baker systems.

CLE 3.0 also includes a nifty performance feature called “core specialization.” Core specialization allows 1 of the 24 cores in a Magny Cours node to be designated as the “OS” core. When activated, all OS tasks are pinned to that one core. Bolding explains that this doesn’t benefit all applications, which is why Cray has taken the step of allowing users to specify whether this feature is turned on or not at runtime (as it did with the CCM). “We have seen application performance range from 20% faster to 5% slower,” says Bolding, “it just depends upon the particulars of each application.”

Some now, more later

The new Cray Linux Environment is being rolled out across Cray’s XT lines in phases. The XT6 and XT6m will be the first to get it this quarter. The XT5 and XT5m will see CLE 3.0 later this year, and it will be available on XT4s in early 2011.

This is clearly an important step for the company, but what about going further and having a single operating environment on all of its products, from the Intel-based CX line up through the XTs? Bolding says this is something they are discussing, but it isn’t on the roadmap today.

We do already know that Cray’s HPCS Cascade system is Intel-based, and that system will run a SUSE-based operating system (codenamed “Nile” if you are keeping track). It seems unlikely that Cray will want to maintain a separate operating system for Cascade and the AMD-based XT line, so at some point (CLE 5?) CLE will probably run on both AMD and Intel processors.

A good move at a good time

Today’s Cray is in the strongest financial position it has seen in quite some time, and it has an excellent stable of (at least until Baker later this year) well-tested hardware that has been enthusiastically received by a core of high end customers. The company has continued to flirt with profitability, but has not yet convincingly crossed that threshold. What it needs to push it over the line is a way to grow the market for its flagship, high(er) margin supercomputers. CLE 3.0 is a significant step in that direction.

While it seems likely that Cray won’t enjoy the full benefit of the ISV compatibility story with new customers until CLE 4.0 brings higher performance to CCM, today’s announcement has already won it significant new business, with the potential for even more growth over the next year.