Quality of Service of Crash Recovery Failure Detectors Project Proposal

Project background 

Distributed systems are widely used across these days and the performance of the distributed system can be measured against different factors and dependability is one of the important factors among them. As the usage of distributed systems has increased a lot worldwide, there are different ranges of fault tolerant mechanism to ensure the exact dependability across the distributed systems. There are many research papers submitted against the dependability of the distributed systems and still it is a great research area. There are different fault tolerant techniques that were introduced to achieve the desired dependability across the distributed systems and to enhance the quality of the service oriented applications across the distributed systems. Apart from the tremendous efforts kept towards achieving the dependability for the distributed systems, still there are many problems across the service oriented applications like detecting the exact failures across the distributed systems and also identify the failed components across the application implementation.

Among many recovery techniques in place to identify the failures, crash recovery has a great importance and there are different models presented across the crash recovery aspect. Detecting the failures across the distributed systems is the common task executed and apart from that the key achievement should be to provide the desired quality of service (QoS) against the failure detection techniques and in specific the crash recovery. In general a simple crash recovery should contain few aspects like identifying the affect on the target, level of effect on the target and observing few aspects like whether the target can be recovered or self healed and the perfect crash recovery should focus on these aspects to detect the exact crash recovery measures. 

In general, whenever there is a crash across the distributed systems, the crash recovery program should stop the joining of the new members to the distributed system and once the recovery process is completed, the joining of the new members should be allowed and this can be considered as the key step across implementing the perfect crash recovery and also to achieve a perfect quality of service. To implement a perfect crash recovery across the distributed systems, always self-repairing and self-healing techniques should be implemented and also the level of this self-healing can be used to measure the accuracy and reliability of the crash recovery implementation across the distributed systems.

In general across the global scale distributed systems, there could different types of crashed and detecting the crashes and recovering them is really a tedious job with respect to providing the quality of service metrics. In this project, the quality of service requirements for a perfect crash recovery program is evaluated with respect to mean time to failure (MTTF) and  mean time to recovery are estimated and the actual aim and objective of this project are as given below 

Aim: To evaluate the quality of service requirements for crash recovery models across the distributed systems 

Objectives 

  • To critically review the distributed systems and their working methodologies
  • To review and investigate various crash detection and recovery techniques and evaluate their performance and limitations
  • To design the quality of service metrics like mean time to failure (MTTF) and mean time to recover (MTTR) across the crash recovery techniques for the distributed systems
  • To implement the proposed design using the dot net coding
  • To evaluate the results against the MTTR and MTTF parameters and evaluate the quality of service requirements for the crash recovery across the distributed systems. 

How the objectives will be achieved 

  1. Literature review is prepared by referring the academic articles, journals, publications, ACM digital library and few websites
  2. Various aspects like the review on distributed systems, crash detection, failure detection and crash recovery techniques are analyzed using the references as mentioned above.
  3. Different quality of service requirements are analyzed against the crash recovery technique for distributed systems and they are reviewed
  4. Design of the application is done using dot net framework and the front end and back end are designed
  5. All the required screens are designed and the required database is created using SQL server
  6. Proposed system is developed using dot net coding
  7. Results are evaluated and the performance of the crash recovery technique is studied against the quality of service requirements
  8. A sample distributed system with multiple applications running is considered and the system is eventually crashed at the programming level and the corresponding crash recovery module is invoked and the required quality of service parameters like MTTF and MTTR are evaluated across the implementation part
  9. Results are evaluated against the aims and objectives of this project with reference to the future work to be done by concluding the actual work done. 

Reasons for selecting this project 

Distributed systems are widely used across these days and even the range of application support for the distributed systems is high. Crash recovery and the corresponding quality of service requirements are really essential to estimate the performance and dependability of the distributed systems. I am interested in developing a perfect crash recovery technique with reference to the required quality of service metrics and thus I have chosen this topic. 

Resource requirements 

Hardware Requirements: 

PROCESSOR        :    PENTIUM IV 2.6 GHz

RAM                    : 512 MB 

Software Requirements:

Software requirements 

Operating System       :  Windows XP/2003 Server

Coding Languages         :   C#.Net, VB.Net,

 Data Base                  :  SQL Server 2005

  Front End                :  Microsoft Visual Studio .Net 2008 

Project Plan 

Task Description Start Date  End Date  Duration
Introduction  Basic introduction to the project along with the problem definition and aims and objectives are defined in this section.      1 week 
Literature review  Following topics are covered under literature review 

  • Review on distributed systems and different crash recovery techniques
  • Evaluation of various crash detection and recovery systems and their limitations
  • Review on quality of service requirements to implement a perfect crash recovery technique

 

    4 weeks 
Design   Following design aspects are covered under this section

 

  • Front end design of the application
  • Database design
  • Modules design

 

 

    3 weeks 
Coding and Implementation  Following aspects are covered under the implementation section 

  • Coding all the modules
  • Establishing the database connections
  • Implementing the modules
  • Explanation to the important coding functions

 

    4 weeks 
Evaluation of results  Results achieved after running the individual scenarios and after comparing the scenarios are explained in this chapter with reference to the aims and objectives of this project.      2 weeks 
Conclusion and Future work   Conclusion from the total work done and the results evaluation is given and the better ways to improve the project in future are also explained.

 

    1 week 

 Deliverables 

Following are the project deliverables 

Initial report contain the following deliverables 

  • Introduction to the project and problem definition
  • Literature review on distributed systems, crash detection and crash recovery techniques
  • Review on various quality of service metrics to measure the performance of the crash recovery systems 

Interim report contain the following deliverables 

  • Detailed explanation to the front end design
  • Detailed explanation to the database design and tables used
  • Explanation to the QoS metrics chosen like MTTF and MTTR across the crash recovery technique

Final project report holds the below information 

  • Information from both the initial and interim report
  • Implementation procedure
  • Results and explanation
  • Conclusion and future work to be done
  • References used

Final product holds the below information 

  • Front end of the application with rich user interface
  • Business logic with a sample distributed system
  • Crash detection module
  • Crash recovery module
  • QoS metrics module
  • Database and the corresponding tables
Paper Submitted & Written by Sathish Nagarajan

Leave a Reply

Your email address will not be published. Required fields are marked *