Personal tools
You are here: Home Tutorials WORLDCOMP'15 Tutorial: Prof. Wenbing Zhao
Call For Participation
General Attendees
Click here for details

Call for Papers
Submission of

Click here for details

Important Dates
July 27-30, 2015
20 joint conferences

Featured Workshops
Doctoral Colloquium
Demos Sessions

Click here for details

« September 2015 »
Su Mo Tu We Th Fr Sa

WORLDCOMP'15 Tutorial: Prof. Wenbing Zhao

Last modified 2015-07-12 11:17

Building Dependable Distributed Systems
Prof. Wenbing Zhao
Associate Professor, Department Electrical and Computer Engineering
Cleveland State University, USA

Date & Time: July 27, 2015 (05:40pm - estimated duration: about 2+ hours)
Location: Sterling ABC Room


    Distributed computing systems are playing an ever increasingly important role in all aspects of our society, governments, businesses, and individuals alike. Such systems are behind many services on which we depend on a daily basis, such as financial (e.g., online banking and stock trading), e-commerce (e.g., online shopping), civil infrastructure (e.g., electric power grid and traffic control), entertainment (e.g, online gaming and multimedia streaming), and personal data storage (e.g., various cloud services such as Dropbox, Google Drive, and SkyDrive). The dependability of these systems no longer matters only to businesses, but matters to every one of us too.

    The cost of system failures is enormous. If a data center is brought down by a system failure, the average cost for downtime may range from $42,000 to about $300,000 per hour. The cost can be estimated by summing up the wasted expenses and the loss of revenue. While the labor cost of downtime may be estimated relatively easily (i.e., roughly, wasted expenses per hour = number of employees X average salary per hour, it is much harder to estimate the loss of revenue, especially due to the damages on the reputation of the business and the loyalty of its potential customers.

    Of course, ensuring high availability of distributed systems is not cheap. The cost of data center is estimated to range from $450 per square foot for 99.671% availability (i.e., 28.8 hours of downtime per year), to $1,100 per square foot for 99.995% availability (i.e., 0.4 hours of downtime per year). That is perhaps one reason why about 59% of Fortune 500 companies suffer from 1.6 hours or more of downtime per week. To reduce the cost of building and maintaining highly dependable systems, an effective way is to train more experts that know how to design, implement, and maintain dependable distributed systems. We hope that this tutorial helps achieve this goal.

    This tutorial will cover the most essential techniques for designing dependable distributed systems. Each technique will be explained and dissected thoroughly so that readers who are not familiar with dependable distributed computing can actually grasp the technique after participating the tutorial. This tutorial will contain the following 8 sections:

    Section 1 introduces the basic concepts and terminologies of dependable distributed computing, as well as the primary means to achieve dependability.

    Section 2 describes the checkpointing and logging mechanisms, which are widely used in practice to achieve some form of fault tolerance (they enable the recoverability of the application but do not prevent service disruption). The biggest advantages of this approach are that it is relatively simple to implement and understand, and it incurs minimum runtime overhead while demanding very modern extra resources (only stable storage). Furthermore, checkpointing and logging also serve as the foundation for more sophisticated dependability techniques. The disadvantage of this approach, if used alone, is that it cannot prevent service disruption from happening. Hence, it is not suitable to be used alone for applications that demand high reliability.

    Section 3 covers research works on recovery-oriented computing, including fault detection and diagnosis, microreboot, and system-level undo and redo. Recovery-oriented computing aims to facilitate faster recovery after a system failure and thereby improving the availability of the system. Similar to checkpointing and logging, the mechanisms for recovery-oriented computing do not prevent service disruption, hence, it is a promising approach for many e-commerce application, but not suitable for applications that require high reliability.

    Section 4 outlines the replication technique for data and service fault tolerance. This is the fundamental technique to ensure high reliability. Through active replication (i.e., the use of multiple redundant copies of the application processes), the system would be able to mask the failure of a replica and continue to process clients' requests. With replication comes the complexity of consistency issue. Ideally, the replicas should always maintain consistency with each other. However, doing so might not incur too much runtime overhead to be acceptable for some applications, or may cause extended period of system unavailability. Hence, strict consistency may have to be compromised either for better performance or for better availability.

    Section 5 explains the group communication systems, which can be used to implement active replication. A group communication system typically offers a totally ordered reliable multicast service for messages, a membership server, and a view synchrony service. These set of services help the replicas to maintain consistency even in the presence of failures, which would reduce the development cost of building dependable systems with active replication. In section, we describe in detail several well-known research works on group communication system construction with different approaches.

    Section 6 discusses the consensus problem and describes several Paxos algorithms, including the Classic Paxos, Dynamic Paxos, Cheap Paxos, and Fast Paxos. Distributed consensus is perhaps the most fundamental problem in distributed computing. While it is easy for a group of processes to agree on the same value if all processes can communicate with each other promptly and if none of them fails. However, distributed consensus is an incredibly hard problem when processes might fail and there might be extended delay to send or receive a message. The classical Paxos algorithm solves the consensus problem (under the non-malicious fault model) in a very elegant and efficient manner by separating the safety concern and the liveness concern. Additional Paxos algorithm are developed to minimize the resources required (for Cheap Paxos), and to reduce the latency for achieving consensus by using a higher redundancy level.

    Section 7 introduces the problem Byzantine fault tolerance. A Byzantine fault is synonymous with a malicious fault. Because a malicious faulty component may choose to behave like any of the non-malicious faults, the Byzantine fault model encompasses any arbitrary fault. The distributed consensus problem under the Byzantine fault model was first introduced several decades ago by Lamport, Shostak, and Pease. A much more efficient algorithm for achieving fault tolerance under the Byzantine fault model (referred to as Byzantine fault tolerance) was proposed by Castro and Liskov in 1999. Since then, the research on Byzantine fault tolerance exploded. With the pervasiveness of cyber attacks and espionages, tolerating malicious faults becomes an urgent concern now instead of being a farfetched problem several decades ago. In this section, we explain in detail several seminal works on this topic.

    Section 8 documents a few research works on the design of customized Byzantine fault tolerance solutions by exploiting the application semantics. For a general-purpose Byzantine fault tolerance algorithm, all requests are totally ordered and executed sequentially in the total order. This imposes severe restrictions on the types of applications that can be supported by the algorithm. By exploiting application semantics, the general-purpose algorithm can be customized to enable the partitioning of requests, the identifying of independent requests, read-only requests, and commutative requests, all of which facilitate concurrent execution of multiple requests. Furthermore, by enabling concurrent execution of selected requests based on the application semantics, potential deadlocks could be prevented.


    This tutorial will enable you to understand the dependability concept, and learn how to build practical dependable distributed systems. In particular, this tutorial will help you:
      • Enhance your applications with basic fault tolerance by incorporating checkinging and logging mechanisms
      • Recover quickly should your applications fail by implementing recovery oriented computing techniques
      • Significantly increase the continuous availability of your applications by adopting space redundancy
      • Ensure the consistency among the replicas by using group communication systems or by implementing the Paxos family consensus algorithms
      • Harden your mission critical applications for intrusion tolerance by incorporating Byzantine fault tolerance techniques
      • Optimize your fault tolerant applications by exploiting application semantics


    This tutorial is intended for faculty, graduate students, engineers, and scientists who want to learn how to design and implement dependable distributed computing systems. In particular, this tutorial may be interested to participants of PDPTA'15 - The 21st International Conference on Parallel and Distributed Processing Techniques and Applications.


    Dr. Zhao is currently an Associate Professor at the Department of Electrical and Computer Engineering, Cleveland State University. He eared his Ph.D. at University of California, Santa Barbara, under the supervision of Drs. Moser and Melliar-Smith, in 2002. Dr. Zhao has authored a research monograph titled: “Building Dependable Distributed Systems” and published over 50 papers in the area of fault tolerant and dependable systems (three of them won best paper awards). A selected list of publications directly relevant to the proposed tutorial are provided below (The entire publication list can be seen at:

      • Zhao, W. (2014). Building Dependable Distributed Systems. Scrivener Publishing and John Wiley & Sons.
      • Zhao, W. (2015). Fast Paxos Made Easy: Theory and Implementation. International Journal of Distributed Systems and Technologies (IJDST), 6(1), 15-33.
      • Chai, H., Zhang, H., Zhao, W., Melliar-Smith, P. M., & Moser, L. E. (2013). Toward Trustworthy Coordination of Web Services Business Activities. IEEE Transactions on Services Computing, 6(2), 276-288.
      • Zhang, H., Chai, H., Zhao, W., Melliar-Smith, P. M., & Moser, L. E. (2011). Trustworthy Coordination of Web Services Atomic Transactions. IEEE Transactions on Parallel and Distributed Systems, 23(8), 1551-1565.
      • Zhang, H., & Zhao, W. (2012). Concurrent Byzantine Fault Tolerance for Software-Transactional-Memory Based Applications. International Journal of Future Computer and Communication, 1(1), 47-50. Won the best paper award at the International Conference on Distributed Computing Engineering, July 2-3, 2012, Hong Kong.
      • Zhao, W. (2007, October). BFT-WS: A Byzantine fault tolerance framework for web services. In EDOC Conference Workshop, 2007. EDOC'07. Eleventh International IEEE (pp. 89-96). IEEE. Won the best paper award.
      • Zhao, W. (2014). Application-aware Byzantine fault tolerance, in Proceedings of the 12th IEEE International Conference on Dependable Autonomic and Secure Computing, August 24-27, Dalian, China
      • Chai, H. & Zhao, W. (2014). Byzantine fault tolerant event stream processing for autonomic computing, in Proceedings of the 12th IEEE International Conference on Dependable Autonomic and Secure Computing, August 24-27, Dalian, China
      • Chai, H. & Zhao, W. (2014). Towards trustworthy complex event processing, in Proceedings of the 5th IEEE International Conference on Software Engineering and Service Science, June 27-29, Beijing, China
      • Chai, H. & Zhao, W. (2014). Byzantine fault tolerance for services with commutative operations, in Proceedings of the 11th IEEE International Conference on Services Computing, June 27-July 2, Anchorage, Alaska.
      • Zhao, W., & Babi, M. (2013, April). Byzantine fault tolerant collaborative editing. In Information and Communications Technologies (IETICT 2013), IET International Conference on (pp. 233-240). IET.
      • Chai, H., & Zhao, W. (2013). Byzantine fault tolerance for session–oriented multi–tiered applications. International Journal of Web Science, 2(1), 113-125.
      • Chai, H., & Zhao, W. (2012). Interaction Patterns for Byzantine Fault Tolerance, in Proceedings of WSE 2012, Springer CCIS Vol. 342, pp. 180-188.
      • Zhang, H., Zhao, W., Moser, L. E., & Melliar-Smith, P. M. (2011). Design and implementation of a Byzantine fault tolerance framework for non-deterministic applications. IET software, 5(3), 342-356.
      • Zhao, W., & Zhang, H. (2009). Proactive service migration for long-running Byzantine fault-tolerant systems. IET software, 3(2), 154-164.
      • Zhao, W. (2008, December). Integrity-Preserving Replica Coordination for Byzantine Fault Tolerant Systems. In Parallel and Distributed Systems, 2008. ICPADS'08. 14th IEEE International Conference on (pp. 447-454). IEEE.
      • Zhao, W., & Villaseca, F. E. (2008, July). Byzantine fault tolerance for electric power grid monitoring and control. In Embedded Software and Systems, 2008. ICESS'08. International Conference on (pp. 129-135). IEEE.
      • Zhao, W., & Zhang, H. (2008, July). Byzantine fault tolerant coordination for web services business activities. In Services Computing, 2008. SCC'08. IEEE International Conference on (Vol. 1, pp. 407-414). IEEE.
      • Zhao, W. (2007, September). A byzantine fault tolerant distributed commit protocol. In Dependable, Autonomic and Secure Computing, 2007. DASC 2007. Third IEEE International Symposium on (pp. 37-46). IEEE.
Conference Proceedings
Get WORLDCOMP'13 & '14 Proceedings
Click Here

Past Events
Click Here

Click Here

Click Here

Click Here

Click Here

Click Here

WORLDCOMP'06, '07, & '08
Click Here

Photo Galleries

Join Our Mailing List
Sign up to receive email announcements and updates about conferences and future events


Administered by UCMSS
Universal Conference Management Systems & Support
San Diego, California, USA
Contact: Kaveh Arbtan

If you can read this text, it means you are not experiencing the Plone design at its best. Plone makes heavy use of CSS, which means it is accessible to any internet browser, but the design needs a standards-compliant browser to look like we intended it. Just so you know ;)