Resilient Computer System Design

Resilient Computer System Design

This book presents a paradigm for designing new generation resilient and evolving computer systems, including their key concepts, elements of supportive theory, methods of analysis and synthesis of ICT with new properties of evolving functioning, as well as implementation schemes and their prototyping. The book explains why new ICT applications require a complete redesign of computer systems to address challenges of extreme reliability, high performance, and power efficiency. The authors present a comprehensive treatment for designing the next generation of computers, especially addressing safety critical, autonomous, real time, military, banking, and wearable health care systems.

Software Design for Resilient Computer Systems

Software Design for Resilient Computer Systems

This book addresses the question of how system software should be designed to account for faults, and which fault tolerance features it should provide for highest reliability. With this second edition of Software Design for Resilient Computer Systems the book is thoroughly updated to contain the newest advice regarding software resilience. With additional chapters on computer system performance and system resilience, as well as online resources, the new edition is ideal for researchers and industry professionals. The authors first show how the system software interacts with the hardware to tolerate faults. They analyze and further develop the theory of fault tolerance to understand the different ways to increase the reliability of a system, with special attention on the role of system software in this process. They further develop the general algorithm of fault tolerance (GAFT) with its three main processes: hardware checking, preparation for recovery, and the recovery procedure. For each of the three processes, they analyze the requirements and properties theoretically and give possible implementation scenarios and system software support required. Based on the theoretical results, the authors derive an Oberon-based programming language with direct support of the three processes of GAFT. In the last part of this book, they introduce a simulator, using it as a proof of concept implementation of a novel fault tolerant processor architecture (ERRIC) and its newly developed runtime system feature-wise and performance-wise. Due to the wide reaching nature of the content, this book applies to a host of industries and research areas, including military, aviation, intensive health care, industrial control, and space exploration.

Principles of Computer System Design

An Introduction

Principles of Computer System Design

Principles of Computer System Design is the first textbook to take a principles-based approach to the computer system design. It identifies, examines, and illustrates fundamental concepts in computer system design that are common across operating systems, networks, database systems, distributed systems, programming languages, software engineering, security, fault tolerance, and architecture. Through carefully analyzed case studies from each of these disciplines, it demonstrates how to apply these concepts to tackle practical system design problems. To support the focus on design, the text identifies and explains abstractions that have proven successful in practice such as remote procedure call, client/service organization, file systems, data integrity, consistency, and authenticated messages. Most computer systems are built using a handful of such abstractions. The text describes how these abstractions are implemented, demonstrates how they are used in different systems, and prepares the reader to apply them in future designs. The book is recommended for junior and senior undergraduate students in Operating Systems, Distributed Systems, Distributed Operating Systems and/or Computer Systems Design courses; and professional computer systems designers. Features: Concepts of computer system design guided by fundamental principles. Cross-cutting approach that identifies abstractions common to networking, operating systems, transaction systems, distributed systems, architecture, and software engineering. Case studies that make the abstractions real: naming (DNS and the URL); file systems (the UNIX file system); clients and services (NFS); virtualization (virtual machines); scheduling (disk arms); security (TLS). Numerous pseudocode fragments that provide concrete examples of abstract concepts. Extensive support. The authors and MIT OpenCourseWare provide on-line, free of charge, open educational resources, including additional chapters, course syllabi, board layouts and slides, lecture videos, and an archive of lecture schedules, class assignments, and design projects.

Active System Control

Design of System Resilience

Active System Control

This book introduces an approach to active system control design and development to improve the properties of our technological systems. It extends concepts of control and data accumulation by explaining how the system model should be organized to improve the properties of the system under consideration. The authors define these properties as reliability, performance and energy-efficiency, and self-adaption. They describe how they bridge the gap between data accumulation and analysis in terms of interpolation with the real physical models when data used for interpretation of the system conditions. The authors introduce a principle of active system control and safety - an approach that explains what a model of a system should have, making computer systems more efficient, a crucial new concern in application domains such as safety critical, embedded and low-power autonomous systems like transport, healthcare, and other dynamic systems with moving substances and elements. On a theoretical level, this book further extends the concept of fault tolerance, introducing a system level of design for improving overall efficiency. On a practical level it illustrates how active system approach might help our systems be self-evolving.

Reliability of Computer Systems and Networks

Fault Tolerance, Analysis, and Design

Reliability of Computer Systems and Networks

With computers becoming embedded as controllers in everything fromnetwork servers to the routing of subway schedules to NASAmissions, there is a critical need to ensure that systems continueto function even when a component fails. In this book, bestsellingauthor Martin Shooman draws on his expertise in reliabilityengineering and software engineering to provide a complete andauthoritative look at fault tolerant computing. He clearly explainsall fundamentals, including how to use redundant elements in systemdesign to ensure the reliability of computer systems andnetworks. Market: Systems and Networking Engineers, Computer Programmers, ITProfessionals.

Software Engineering for Resilient Systems

8th International Workshop, SERENE 2016, Gothenburg, Sweden, September 5-6, 2016, Proceedings

Software Engineering for Resilient Systems

This book constitutes the refereed proceedings of the 8th International Workshop on Software Engineering for Resilient Systems, SERENE 2016, held in Gothenburg, Sweden, in September 2016.The 10 papers presented were carefully reviewed and selected from 15 submissions. They cover the following areas: development of resilient systems; incremental development processes for resilient systems; requirements engineering and re-engineering for resilience; frameworks, patterns and software architectures for resilience; engineering of self-healing autonomic systems; design of trustworthy and intrusion-safe systems; resilience at run-time (mechanisms, reasoning and adaptation); resilience and dependability (resilience vs. robustness, dependable vs. adaptive systems); verification, validation and evaluation of resilience; modeling and model based analysis of resilience properties; formal and semi-formal techniques for verification and validation; experimental evaluations of resilient systems; quantitative approaches to ensuring resilience; resilience prediction; cast studies and applications; empirical studies in the domain of resilient systems; methodologies adopted in industrial contexts; cloud computing and resilient service provisioning; resilience for data-driven systems (e.g., big data-based adaption and resilience); resilient cyber-physical systems and infrastructures; global aspects of resilience engineering: education, training and cooperation.

Resilient Architecture Design for Voltage Variation

Resilient Architecture Design for Voltage Variation

Shrinking feature size and diminishing supply voltage are making circuits sensitive to supply voltage fluctuations within the microprocessor, caused by normal workload activity changes. If left unattended, voltage fluctuations can lead to timing violations or even transistor lifetime issues that degrade processor robustness. Mechanisms that learn to tolerate, avoid, and eliminate voltage fluctuations based on program and microarchitectural events can help steer the processor clear of danger, thus enabling tighter voltage margins that improve performance or lower power consumption. We describe the problem of voltage variation and the factors that influence this variation during processor design and operation. We also describe a variety of runtime hardware and software mitigation techniques that either tolerate, avoid, and/or eliminate voltage violations. We hope processor architects will find the information useful since tolerance, avoidance, and elimination are generalizable constructs that can serve as a basis for addressing other reliability challenges as well. Table of Contents: Introduction / Modeling Voltage Variation / Understanding the Characteristics of Voltage Variation / Traditional Solutions and Emerging Solution Forecast / Allowing and Tolerating Voltage Emergencies / Predicting and Avoiding Voltage Emergencies / Eliminiating Recurring Voltage Emergencies / Future Directions on Resiliency

Introduction to Noise-Resilient Computing

Introduction to Noise-Resilient Computing

Noise abatement is the key problem of small-scaled circuit design. New computational paradigms are needed -- as these circuits shrink, they become very vulnerable to noise and soft errors. In this lecture, we present a probabilistic computation framework for improving the resiliency of logic gates and circuits under random conditions induced by voltage or current fluctuation. Among many probabilistic techniques for modeling such devices, only a few models satisfy the requirements of efficient hardware implementation -- specifically, Boltzman machines and Markov Random Field (MRF) models. These models have similar built-in noise-immunity characteristics based on feedback mechanisms. In probabilistic models, the values 0 and 1 of logic functions are replaced by degrees of beliefs that these values occur. An appropriate metric for degree of belief is probability. We discuss various approaches for noise-resilient logic gate design, and propose a novel design taxonomy based on implementation of the MRF model by a new type of binary decision diagram (BDD), called a cyclic BDD. In this approach, logic gates and circuits are designed using 2-to-1 bi-directional switches. Such circuits are often modeled using Shannon expansions with the corresponding graph-based implementation, BDDs. Simulation experiments are reported to show the noise immunity of the proposed structures. Audiences who may benefit from this lecture include graduate students taking classes on advanced computing device design, and academic and industrial researchers. Table of Contents: Introduction to probabilistic computation models / Nanoscale circuits and fluctuation problems / Estimators and Metrics / MRF Models of Logic Gates / Neuromorphic models / Noise-tolerance via error correcting / Conclusion and future work