Failure Recovery in Distributed Systems

Failure recovery programs are driven with respect to the requirements and behavior of the faults across the systems. There are different cases to be considered against the common failures across the distributed systems and there are the possible solutions suggested as well. Following are the few cases identified and the corresponding solutions suggested across the literature.

If the server is down and the client can’t locate the server across the distributed systems, then a simple exception handling can be done and this may not be considered as the feasible solution as most of the programming languages use complex logic across the exception handling mechanism.

If the request of the client is lost to the server across the distributed systems, a simple timeout operation can be used to await the clients till the server responds and if the time is exceeded, the request is reinitiated. This can’t be considered as the optimal implementation as there could be chances of performance degradation due to the timeout operations and also there could be chances of idempotent operations across the distributed systems. In most of the cases, the servers may be crashed once they receive the request from the clients and if this case, the clients keep on waiting for the reply from the server.

The best possible solution for this case is that, rebuilt the server and also rebuilt the client to make the requests of the clients successful. There could be chances where the server reply to the client is lost before reaching the desired client and the server would not be aware of these situations and keep on pinging the client for the required acknowledgment. The usual solution implemented across this failure is that the client sets a maximum time limit and if the time is exceeded, it assumes the server is lost or busy. 

Thus there are different failures and corresponding possible solutions identified across the distributed systems and these solutions as discussed not optimal in nature and can work fine for small range of failures and they used to fail in case of major failures across the server of a typical distributed system.

This paper is written and submitted by sai

Leave a Reply

Your email address will not be published. Required fields are marked *