Interview: Numascale Aggregates Big Memory with Commodity Servers

Does your application need Big Memory to run efficiently? Numascale has been making waves of late with their rather clever hardware-based server aggregation platform. To learn more, I caught up with their VP of Business Development, Einar Rustad.

insideHPC: How does NumaConnect enable large shared-memory systems to be built from commodity servers?

Einar Rustad: We have developed an ASIC with all the logic including distributed switching for 1-, 2- and 3-D Torus topologies. The chip is mounted on a PCB for plugging into HTX. For boxes that lack HTX connector we have a solution which picks up the HyperTransport signal from a CPU socket.

insideHPC: How is cache-coherency maintained amongst so many nodes?

Einar Rustad: This is done with 64 byte cache line granularity (same as the processor caches) through a directory based coherency protocol.
We use DRAM for storing tag information and another for storing remote cache lines. We call this “Remote Cache”, but it is local to each node and mounted on our PCB. One such PCB is used for each motherboard. This is then an L4 cache and the size is configurable from 2 to 8 GBytes per node.

insideHPC: What is your largest deployment to date and how well does it perform?

Einar Rustad: The largest that has been tested by a customer is a 32-node system with 384 cores and 1 TByte main memory. The test was very successful with the system showing linear scalability versus a cluster solution that had negative scaling beyond a single node. The application was doing reverse time migration (RTM) seismic data processing.

The largest to be installed these days is a 72-node system based on the IBM x3755 with a total of 1728 cores an 4.6 TBytes main memory.

insideHPC: How does the Numascale hardware performance compare to software-based server aggregation solutions?

Einar Rustad: We do not have any relevant performance data from software-based solutions since all the benchmarks we have seen from those are running applications that can fit the data set in the memory of a single node. Since these solutions necessarily have to be based on the minimum 4k page granularity rather than 64 byte cache lines, we can hardly believe that they will perform very well on codes that have a reasonably random access pattern in a memory space that exceeds the memory of a single node. In fact we have had customers with such systems ask us to run a simple program that touch different pages around the whole memory space and that were really impressed when this was done in a short blink of a second, whereas this basically brings a software based system to a halt.

insideHPC: How does the ability to run extremely large problems in-memory change the way researchers do science?

Einar Rustad: We believe that hey can be much more productive since they can keep entire data sets in memory without having to decompose them. Especially graph processing will be much more efficient with 1-2 microsecond access to any record within up to a 256TByte memory space. The programming model for shared memory is also much easier than the explicit massage passing with less code (approximately 50%) and correspondingly fewer bugs. This will increase programmers productivity and also expand the community of programmers that can write software for such systems since the programming model is exactly the same as for their desktop or even laptop. In fact, one interesting comment from the guy that tested the seismic code was: The system is so easy to use; it works just like my laptop except that it is much bigger and more powerful!

insideHPC: How does the price of a Numascale cluster made from commodity servers compare to a comparable large-memory SGI UV system?

Einar Rustad: From what we have seen the difference is about a factor of ten.

You can check out Numascale at SC12 booth #3218.