More Crazy Mpi Ideas: Fault Detection And Recovery

I had a good conversation with an ISV yesterday who makes a popular MPI-based simulation application. One of the things I like to do in these kinds of conversations is ask the ISV engineers two questions:

What new features do you want from the MPI implementations that you use?
What new features or changes do you want from the MPI API itself?

You know - talk to theactual usersof MPI and see what they want, both from an implementation perspective and from a standards perspective. Shocking!

One of the big items that came out of our discussion was a desire for better fault tolerance and/or resilience in MPI applications.

To be fair: fault tolerance is abigtopic, and full of both difficult and contentious issues. But the big point that they wanted was actually surprisingly simple in concept:

When an MPI process fails (for whatever reason), guarantee that all other MPI processes that are stuck in blocking MPI API calls involving the dead process return with some kind of reasonable error code.

They didn't care too much about continuing MPI after that - they just wanted to know that an error occurred so that they could save some state to stable storage, perhaps print a helpful error message for the end user, or otherwise clean up after the run. This is a considerably smaller goal than other fault tolerance efforts (e.g., to be able to continue an MPI job after a failure).

So let's talk about fault detection.

It'susuallyeasy enough for an MPI implementation to figure out when the remote peer in a blocking send/receive operation has failed. Especially when the MPI is using some form of reliable network communication, because the networking layer will tell the MPI implementation when it can no longer reach a peer.

...but not always. Consider:

Perhaps the network has totally failed between the two peers, such that not even negative acknowledgements (NAKs) can flow between them (i.e., one process can't tell the other that it has failed). Put differently: in the steady state of an MPI job, silence between peers rarely means process failure.
Perhaps the MPI implementation is using unreliable data transports (e.g., UDP or other unreliable datagrams). Losses are then both common and expected - meaning that NAKs can get corrupted or lost.
The remote peer may not be in the MPI library, or otherwise may not be actively sending traffic to the local peer (e.g., the remote peer may not have posted the matching send or receive yet). Again: silence may not mean failure.

Many of these kinds of issues can be resolved in an "out of band" control network - e.g., the run-time system can monitor the individual processes in an MPI job, and can signal its peers in the event of an unexpected death. ...but there are scalability issues with this kind of approach, too. Let's not forget prior blog entries where I have discussed scalability challenges in MPI/HPC runtime systems.

The situation gets even more complex if there are non-blocking communications ongoing involving many peers, some of whom may have failed.

And it gets further complexified (I just made up that word; deal with it) when your processes fail partway through collective, dynamic process, or one-sided operations. Hardware support (potentially from the network) may be required to handle such failure detection efficiently. Or, put differently: we do not want to penalize the performance of thefar-more-commoncase of success by adding a lot of invasive and potentially performance-costing infrastructure to check for failure during MPI operations.

I should note that a flavor of this kind of failure detection is currently included in the MPI Forum Fault Tolerance Working Group's (FTWG) proposal for MPI-4 (in addition to other FT-related provisions). This is quite promising.

But there's still much discussion that must occur; other users want more than "simple" failure detection, for example - they want some kind of recovery (different models of which are under hot debate).

What kinds of failure detection and/or recovery would you find useful in your application?

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

Серверы с серверами

Новости по теме

Difference between campus switch and data center switch

Huawei S6730-H28Y4C Campus CloudEngine Switch Datasheet

S6730-H48Y6C: Unleashing Power and Flexibility for Modern Networking

CloudEngine S6730-H Series Switches Datasheet

Huawei CloudEngine Switch S6730-S24X6Q Datasheet

CloudEngine S6700 Series Switches Naming Conventions & Description

Huawei CloudEngine S6730-H24X6C Datasheet

Huawei S6730 Series Switches Datasheet

Huawei CloudEngine Switch S6730-H48X6C Datasheet

Introduction to the Huawei CloudEngine S6730-S Series Switches

Huawei S6730-H48X6CZ-V2: The Ultimate High-Speed Network Switch

Overview of the S6730-H28X6CZ-V2 Switch

Huawei CloudEngine S6730-H24X4Y4C: A High-Performance Enterprise Switch for Modern Networks

Introduction to Huawei CloudEngine S6730-H Series Switches

Comprehensive Guide to the CloudEngine S6730-H24X6C-V2: Features, Specifications, and Applications

Huawei S6730-S24X6Q: Advanced Ethernet Switch for Modern Networks

Comprehensive Guide to the S6730-H48X6C-V2 High-Performance Switch

Huawei CloudEngine S6730-H28Y4C: High-Performance Switch for Modern Networks

Overview of the S6730-H24X6C-V2

Unveiling the Huawei CloudEngine S6730 Series: Advanced Switching for Modern Networks

Huawei S6730-H48X6C: A Comprehensive Overview

Comprehensive Guide to Huawei S6730-H24X6C

Huawei Switches Visio Stencils

Huawei Switches Distributor in UAE

PoE vs PoE+ vs UPoE: What's the best switch to meet your network needs?

Understanding PoE Standards and Wattage

Power Supply Standards for POE Switches. Why is the Power Supply Distance Limited to 100 Meters?

How to Choose the Right 10G SFP+ Module: SR, LR, or LRM?

Huawei Switches: Comprehensive Guide and Insights

How Does Cisco Wireless Network Work?

More crazy MPI ideas: Fault detection and recovery

Горячие метки: HPC (HPC) mpi

Ordering Guide

Ресурсы по программам

О нас