At SC14 in New Orleans, Altair announced key features for PBS Professional 13.0, scheduled to launch in Q1 2015. The new version will take scalability to the next level, with massive jumps in supported system size, job dispatch speed and throughput; users will also benefit from key resilience, flexibility and scheduling improvements.
“We’ve architected PBS13 with exascale in mind, supporting over 1 million jobs per day and increasing throughput by 10x while hardening our resiliency and scheduling features,” says Bill Nitzberg, CTO of PBS Works. “It’s the biggest release I’ve been involved with since PBS Professional was created.”
Altair Knows HPC
Proven for over 20 years at thousands of global sites, Altair’s PBS Professional is a market-leading workload management and job scheduling system for high-performance computing (HPC) environments. PBS Professional manages workload for the world’s largest supercomputers.
With PBS Professional 13.0, the PBS Works Suite solidifies its position as the industry’s most comprehensive suite of commercial-grade software for HPC workload management. The suite includes software for web-based job submission and monitoring, remote visualization and analytics/reporting, along with the PBS Pro centerpiece for powerful scheduling.
At Supercomputing 2014 the PBS Works suite was named “Best HPC Software Product or Technology” by readers of HPCwire, an honor that Altair CEO James Scapa said “signals our success in creating a comprehensive offering that users recognize is critical to effective high performance computing.”
Altair is unique in the HPC space. In addition to delivering PBS Works for workload management, Altair also develops a market-leading set of high-performance engineering applications (HyperWorks) – and employs over 700 engineers (ProductDesign) who work daily with customers using PBS Works and HyperWorks to solve real challenges. No other organization has a better understanding of the needs of HPC users, and what it takes to implement HPC solutions with success, efficiency and ROI.
Altair’s PBS13 is architected for exascale: the company is testing PBS13 to 100,000+ nodes and promises a dispatch rate of 100 jobs per second a 15x improvement over the current version. PBS13 will offer fast, reliable startup of huge MPI jobs (jobs with tens of thousands of MPI ranks), as well as fast throughput on short jobs. In addition, PBS13’s comprehensive health check framework monitors the behavior of a user’s health check scripts, to improve resilience and productivity.
Don’t miss the PBS Professional 13.0 webinar with Altair CTO Bill Nitzberg on December 3 – click here to register.
PBS13 will also support Control Groups (cgroups) which eliminates resource contention, so jobs run faster and don’t interfere with each other or the OS. cgroups are implemented via a flexible plugin approach and will be Limited Availability for 13.0. PBS13 also offers expanded plugin events to enable even greater extensibility and customization. Finally, PBS13 offers more scheduling policy controls and fine-grained targeting for preemption to better address complex business needs.
Under the Covers
Altair is refactoring the underlying protocols and data structures of PBS Professional, reducing data sharing, making key operations non-blocking, multi-threading communications, adding horizontal parallelism, and employing structured performance profiling of real (>1 petaflop) workloads.
Today’s PBS Professional already has no single points of failure — a user can kill (power off, unplug, kill -9 even) any single component and not lose any running nor queued jobs (even jobs running on nodes that are powered off will automatically be rerun elsewhere). The PBS Plugin framework enables sophisticated health checking via “hooks” that can not only check node health but also take nodes offline, reboot problem nodes, restart the scheduling cycle, and notify the administrator.
Altair’s resilience roadmap includes extending this level of reliability to all PBS Works products. In addition, Altair is adding additional hook events to enable detecting new types of failures. These features are designed with the user in mind, informed by Altair’s experiences with the thousands of cluster admins and application end users who employ their products.