checkpoint restart Archives - High-Performance Computing News Analysis

Checkpointing the Un-checkpointable: MANA and the Split-Process Approach

September 8, 2019 by Doug Black

Gene Cooperman from Northeastern University gave this talk at the MVAPICH User Group. “This talk presents an efficient, new software architecture: split processes. The “MANA for MPI” software demonstrates this split-process architecture. The MPI application code resides in “upper-half memory”, and the MPI/network libraries reside in “lower-half memory”.

Filed Under: Compute, CPUs, GPUs, FPGAs, Events, HPC Hardware, HPC Software, Industry Segments, Main Feature, News, Parallel Programming, Research / Education, Resources, Storage, Videos Tagged With: checkpoint restart, MANA for MPI, MPI, MVAPICH, MVAPICH User Group Meeting, nvidia, Weekly Newsletter Articles

Easy and Efficient Multilevel Checkpointing for Extreme Scale Systems

July 24, 2018 by Doug Black

Leonardo Bautista from the Barcelona Supercomputing Center gave this talk at PASC18. “Extreme scale supercomputers offer thousands of computing nodes to their users to satisfy their computing needs. As the need for massively parallel computing increases in industry, computing centers are being forced to increase in size and to transition to new computing technologies. In this talk, we will discuss how to guarantee high reliability to high performance applications running in extreme scale supercomputers. In particular, we cover the tools necessary to implement scalable multilevel checkpointing for tightly coupled applications.”

Filed Under: Compute, Events, HPC Hardware, HPC Software, Industry Segments, Main Feature, News, Research / Education, Resources, Videos Tagged With: Barcelona Supercomputing Centre, BSC, checkpoint restart, PASC18, Weekly Newsletter Articles

High Availability HPC: Microservice Architectures for Supercomputing

February 27, 2018 by Doug Black

Ryan Quick from Providentia Worldwide gave this talk at the Stanford HPC Conference. “Microservices power cloud-native applications to scale thousands of times larger than single deployments. We introduce the notion of microservices for traditional HPC workloads. We will describe microservices generally, highlighting some of the more popular and large-scale applications. Then we examine similarities between large-scale cloud configurations and HPC environments. Finally we propose a microservice application for solving a traditional HPC problem, illustrating improved time-to-market and workload resiliency.”

Filed Under: Compute, Datacenter, Editor's Choice, Events, Exascale, Government, HPC Hardware, HPC Software, Industry Segments, Main Feature, News, Resources, Videos Tagged With: checkpoint restart, HPC AI Advisory Council, microservices, Providentia Worldwide, resilience, Stanford HPC Conference, Weekly Newsletter Articles

RCE Podcast Looks at Scalable Checkpoint/Restart (SCR)

December 30, 2013 by Doug Black

In this podcast, Brock Palen and Jeff Squyres speak with Kathryn Mohror and Adam Moody about Scalable Checkpoint/Restart (SCR).

Filed Under: Featured, HPC Software, News, Podcast, Resources, Tools Tagged With: checkpoint restart

Checkpointing the Un-checkpointable: MANA and the Split-Process Approach

Easy and Efficient Multilevel Checkpointing for Extreme Scale Systems

High Availability HPC: Microservice Architectures for Supercomputing

RCE Podcast Looks at Scalable Checkpoint/Restart (SCR)

Sponsored Guest Articles

Microsoft and NVIDIA Together Advance AI

White Papers

Energy efficiency drives HPC to the cloud

Featured RSS Feed

More News from insideBIGDATA