1 of 66

E-Book: Bayesian Networks & BayesiaLab — A Practical Introduction for Researchers

By Stefan Conrady and Dr. Lionel Jouffe

The Original Book

We released our first book on Bayesian networks and BayesiaLab at the 3rd Annual BayesiaLab Conference in Fairfax, Virginia, in October of 2015. Among BayesiaLab users, it soon became known simply as "the book" and served as their principal reference. For students of Bayesian networks, it emerged as a very popular textbook, a kind of Bayesian Networks 101.

Beyond the hardcopy, which remains available on Amazon, we have been offering our book as a free PDF, which has been downloaded over 30,000 times since its launch.

However, with the rapid development of new features in BayesiaLab, it's been impossible to keep the book up to date with current screenshots, etc. The BayesiaLab user interface has also undergone a major redesign, and even the BayesiaLab-specific terminology has evolved since 2015. While working intermittently on drafts of a long-promised second edition, we turned our book into a "living document" that can be updated in sync with the software and the user manual.

The New Online Edition

The book's new version is now embedded on our website. As a result, all the original cross-references in the original book have been translated into hyperlinks that can instantly take you to datasets, networks, videos, and other related learning resources.

Despite the ease by which you can jump from the book to the manual and tutorials, you can still follow the linear structure of a traditional book. Countless readers said they enjoyed reading it cover-to-cover, just like a novel.

Overview of Book

Preface

While Bayesian networks have flourished in academia over the past three decades, their application for research has developed more slowly. One reason has been the sheer difficulty of generating Bayesian networks for practical research and analytics use. Researchers had to create their own software for many years to utilize Bayesian networks. Needless to say, this made Bayesian networks inaccessible to the vast majority of scientists.

The launch of BayesiaLab 1.0 in 2002 was a major initiative by a newly formed French company to address this challenge. The development team, led by Dr. Lionel Jouffe and Dr. Paul Munteanu, designed BayesiaLab with research practitioners in mind—rather than fellow computer scientists. First and foremost, practitioner orientation is reflected in the graphical user interface of BayesiaLab, which allows researchers to work interactively with Bayesian networks in their native form using graphs, as opposed to working with computer code. At the time of writing, BayesiaLab is approaching its sixth major release and has developed into a software platform that provides a comprehensive “laboratory” environment for many research questions.

However, the point-and-click convenience of BayesiaLab does not relieve one of the duties of understanding the fundamentals of Bayesian networks for conducting sound research. With BayesiaLab making Bayesian networks accessible to a much broader audience than ever, demand for the corresponding training has grown tremendously. We recognized the need for a book that supports a self-guided exploration of this field. This book aims to provide a practice-oriented introduction to both Bayesian networks and BayesiaLab.

This book reflects the inherently visual nature of Bayesian networks. Hundreds of illustrations and screenshots provide a tutorial-style explanation of BayesiaLab’s core functions. Particularly important steps are repeatedly shown in the context of different examples. The key objective is to provide the reader with step-by-step instructions for transitioning from Bayesian network theory to fully functional implementations in BayesiaLab.

The fundamentals of the Bayesian network formalism are linked to numerous disciplines, including computer science, probability theory, information theory, logic, machine learning, and statistics. Also, in terms of applications, Bayesian networks can be utilized in virtually all disciplines. Hence, we meander across many fields of study with the examples presented in this book. Ultimately, we will show how all of them relate to the Bayesian network paradigm. At the same time, we present BayesiaLab as the technology platform, allowing the reader to move immediately from theory to practice. Our goal is to use practical examples to reveal the Bayesian network theory and simultaneously teach the BayesiaLab technology.

Structure of the Book

Part 1

The three short chapters in Part 1 of the book aim to provide a basic familiarity with Bayesian networks and BayesiaLab, from which the reader should feel comfortable jumping into any of the subsequent chapters. Part 1 could serve as an executive summary for a cursory observer of this field.

Chapter 1 provides motivation for using Bayesian networks from the perspective of analytical modeling.
Chapter 2 is adapted from Pearl and Russell (2000) and introduces the Bayesian network formalism and semantics.
Chapter 3 presents a brief overview of the BayesiaLab software platform and its core functions.

Part 2

The chapters in Part 2 are mostly self-contained tutorials, which can be studied out of sequence. However, beyond Chapter 8, we assume a certain degree of familiarity with BayesiaLab’s core functions.

In Chapter 4, we discuss how to encode causal knowledge in a Bayesian network for subsequent probabilistic reasoning. In fact, this is the field in which Bayesian networks gained prominence in the 1980s in the context of building expert systems.
Chapter 5 introduces data and information theory as a foundation for subsequent chapters. BayesiaLab’s data handling techniques, such as the Data Import Wizard, including Discretization, are presented in this context. Furthermore, we describe a number of information-theoretic measures that will subsequently be required for machine learning and network analysis.
Chapter 6 introduces BayesiaLab’s Supervised Learning algorithms for predictive modeling in the context of a classification task in the field of cancer diagnostics.
Chapter 7 demonstrates BayesiaLab’s Unsupervised Learning algorithms for knowledge discovery from financial data.
Chapter 8 builds on these machine-learning methods and shows a prototypical research workflow for creating a Probabilistic Structural Equation Model for a market research application.
Chapter 9 deals with missing values, which are typically not of principal research interest but adversely affect most studies. BayesiaLab leverages the conceptual advantages of machine learning and Bayesian networks to reliably impute missing values.
Chapter 10 closes the loop by returning to the topic of causality, which we first introduced in Chapter 4. We examine approaches for identifying and estimating causal effects from observational data. Simpson’s Paradox serves as an example for this study.
Chapter 11 showcases a practical implementation of the causal concepts introduced in the previous chapter. Marketing mix modeling and optimization serve as a prototypical example.
Chapter 12 illustrates the complexity of the seemingly straightforward concepts of attribution and contribution. Bayesian networks can clarify their meaning and make their computation possible.

Notation

BayesiaLab methods, functions, and components are capitalized and shown in bold type, e.g., Data Import Wizard. This helps distinguish between natural language expressions, such as "parameter estimation" in general and the Parameter Estimation function in BayesiaLab in particular. Furthermore, concepts that are central to BayesiaLab, such as Entropy and Mutual Information, are emphasized in the same fashion, even though they are not exclusive to the BayesiaLab software.
Hyperlinks are blue, e.g., BayesiaLab User Guide.
User interactions with menus, e.g., the Main Menu or any of the Context Menus, are highlighted with a gray background: Main Menu > Learning > Supervised Learning > Markov Blanket.
Keyboard shortcuts are marked with a gray background, e.g., Ctrl+S.

Chapter 1: Introduction

With Professor Judea Pearl receiving the prestigious 2011 A.M. Turing Award, Bayesian networks have presumably received more public recognition than ever before. Judea Pearl’s achievement of establishing Bayesian networks as a new paradigm is fittingly summarized by Stuart Russell (2011):

“[Judea Pearl] is credited with the invention of Bayesian networks, a mathematical formalism for defining complex probability models, as well as the principal algorithms used for inference in these models. This work not only revolutionized the field of artificial intelligence but also became an important tool for many other branches of engineering and the natural sciences. He later created a mathematical framework for causal inference that has had significant impact in the social sciences.”

All Roads Lead to Bayesian Networks

There are numerous ways we could take to provide motivation for using Bayesian networks. A selection of quotes illustrates that we could approach Bayesian networks from many different perspectives, such as machine learning, probability theory, or knowledge management.

“Bayesian networks are as important to AI and machine learning as Boolean circuits are to computer science.” (Stuart Russell in Darwiche, 2009)
“Bayesian networks are to probability calculus what spreadsheets are for arithmetic.” (Conrady and Jouffe, 2015)
“Currently, Bayesian Networks have become one of the most complete, self-sustained and coherent formalisms used for knowledge acquisition, representation and application through computer systems.” (Bouhamed et al., 2015)

In this first chapter, however, we approach Bayesian networks from the viewpoint of analytical modeling. Given today’s enormous interest in analytics, we wish to relate Bayesian networks to traditional analytic methods from the field of statistics and, furthermore, compare them to more recent innovations in data mining. This context is particularly important given the attention that Big Data and related technologies receive these days. Their dominance in terms of publicity perhaps drowns out some other important methods of scientific inquiry, whose relevance becomes evident by employing Bayesian networks.

Once we have established how Bayesian networks fit into the “world of analytics,” Chapter 2 explains the mathematical formalism that underpins the Bayesian network paradigm. For an authoritative account, Chapter 2 is largely based on a technical report by Judea Pearl. While employing Bayesian networks for research has become remarkably easy with BayesiaLab, we need to emphasize the importance of theory. A solid understanding of this theory will allow researchers to correctly employ Bayesian networks.

Finally, Chapter 3 concludes the first part of this book with an overview of the BayesiaLab software platform. We show how the theoretical properties of Bayesian networks translate into a capable research tool for many fields of study, ranging from bioinformatics to marketing science and beyond.

A Map of Analytic Modeling

Following the ideas of Breiman (2001) and Shmueli (2010), we create a "map of analytic modeling" that is defined by two axes:

The x-axis reflects the Modeling Purpose, ranging from Association/Correlation to Causation. Labels on the x-axis indicate a conceptual progression, including Description, Prediction, Explanation, Simulation, and Optimization.
The y-axis represents the Model Source, i.e., the source of the model specification. Model Source ranges from Theory (bottom) to Data (top). Theory is also tagged with Parametric as the predominant modeling approach. Additionally, it is tagged with Human Intelligence, hinting at the origin of Theory. On the opposite end of the y-axis, Data is associated with Machine Learning and Artificial Intelligence. It is also tagged with Algorithmic in contrast to Parametric modeling.

Needless to say, this map displays a highly simplified view of the world of analytics. Despite this caveat, we will use this map and its coordinate system to position different modeling approaches.

Quadrant 2: Predictive Modeling

Many of today’s predictive modeling techniques are algorithmic and would fall mostly into Quadrant 2. In Quadrant 2, a researcher would be primarily interested in the predictive performance of a model, i.e., Y is of interest.

Neural networks are a typical example of implementing machine learning techniques in this context. Such models often lack theory. However, they can be excellent “statistical devices” for producing predictions.

Quadrant 4: Explanatory Modeling

In Quadrant 4, the researcher is interested in identifying a model structure that best reflects the underlying “true” data-generating process, i.e., we are looking for an explanatory model. Thus, the function f is of greater interest than Y:

Traditional statistical techniques that have an explanatory purpose and are used in epidemiology and the social sciences would mostly belong in Quadrant 4. Regressions are the best-known models in this context. Extending further into the causal direction, we would progress into the field of operations research, including simulation and optimization.

Despite the diverging objectives of predictive modeling versus explanatory modeling, i.e., predicting Y versus understanding f, the respective methods are not necessarily incompatible. In our map, this is suggested by the blue boxes that gradually fade out as they cross the boundaries and extend beyond their “home” quadrant. However, the best-performing modeling approaches do rarely serve predictive and explanatory purposes equally well. In many situations, the optimal fit-for-purpose models remain very distinct from each other. In fact, Shmueli (2010) has shown that a structurally “less true” model can yield better predictive performance than the “true” explanatory model.

We should also point out that recent advances in machine learning and data mining have mostly occurred in Quadrant 2 and disproportionately benefited predictive modeling. Unfortunately, most machine-learned models are remarkably difficult to interpret in terms of their structural meaning, so new theories are rarely generated this way. For instance, the well-known Netflix Prize competition produced well-performing predictive models but yielded little explanatory insight into the structural drivers of choice behavior.

Conversely, in Quadrant 4, it remains difficult to machine-learn explanatory models. Unlike Quadrant 2, the availability of ever-increasing amounts of data is not necessarily an advantage for discovering theory through machine learning.

Bayesian Networks: Theory and Data

Concerning the horizontal division between Theory and Data on the Model Source axis, Bayesian networks have a special characteristic. Bayesian networks can be built from human knowledge, i.e., from Theory, or they can be machine-learned from Data. Thus, they can use the entire spectrum as Model Source.

Also, due to their graphical structure, machine-learned Bayesian networks are visually interpretable, therefore promoting human learning and theory building. As indicated by the bi-directional arc in the following diagram, Bayesian networks allow human learning and machine learning to work in tandem, i.e., Bayesian networks can be developed from a combination of human and artificial intelligence.

Bayesian Networks: Association and Causation

Beyond crossing the boundaries between Theory and Data, Bayesian networks also have special qualities concerning causality. Under certain conditions and with specific theory-driven assumptions, Bayesian networks facilitate causal inference. In fact, Bayesian network models can cover the entire range from Association/Correlation to Causation, spanning the entire x-axis of the map below. In practice, this means that we can add causal assumptions to an existing non-causal network and, thus, create a causal Bayesian network. This is particularly important when we try to simulate an intervention in a domain, such as estimating the effects of a treatment. Working with a causal model is imperative in this context, and Bayesian networks help us make that transition.

As a result, Bayesian networks are a versatile modeling framework suitable for many problem domains. The mathematical formalism underpinning the Bayesian network paradigm will be presented in the next chapter.

Chapter 2: Bayesian Network Theory

This chapter is based mainly on Pearl and Russell (2000) and was adapted with permission.

History

Probabilistic Graphical Models based on Directed Acyclic Graphs (DAG) have a long and rich tradition, beginning with the work of geneticist Sewall Wright in the 1920s. Variants have appeared in many fields. Within statistics, such models are known as directed graphical models; within cognitive science and artificial intelligence, such models are known as Bayesian networks. The name honors the Rev. Thomas Bayes (1702-1761), whose rule for updating probabilities in the light of new evidence is the foundation of the approach.

Bayesian Networks are also known as Bayesian Belief Networks (BBN for short) or Bayes Nets. All these variations are entirely equivalent. However, in this book, we exclusively use the term Bayesian Network.

Bayes' Theorem

Motivation for Developing Bayesian Networks

The initial development of Bayesian networks in the late 1970s was motivated by the necessity of modeling top-down (semantic) and bottom-up (perceptual) combinations of evidence for inference. The capability for bidirectional inference, combined with a rigorous probabilistic foundation, led to the rapid emergence of Bayesian networks. They became the method of choice for uncertain reasoning in artificial intelligence and expert systems, replacing earlier, ad hoc rule-based schemes.

Bayesian Network Elements

Bayesian networks are models that consist of two parts: a qualitative and a quantitive part.

Qualitative Part

The qualitative part is a Directed Acyclic Graph (DAG) that specifies the dependencies between variables.

Nodes represent variables of interest (e.g., the temperature of a device, the gender of a patient, a feature of an object, or the occurrence of an event). Such nodes can correspond to symbolic/categorical variables, numerical variables with discrete values, or discretized continuous variables.

Quantitive Part

The quantitative part is based on local probability distributions for specifying the probabilistic relationships between nodes.

The local probability distributions can be either marginal for nodes without parents (Root Nodes) or conditional for nodes with parents. In the latter case, the dependencies are quantified by Conditional Probability Tables (CPT) for each node given its parents in the graph.

Once fully specified, a Bayesian network compactly represents a Joint Probability Distribution (JPD) and, thus, can be used for computing the posterior probabilities of any subset of variables given evidence about any other subset.

A Non-Causal Bayesian Network Example

A Causal Network Example

Perhaps the most important aspect of Bayesian networks is that they are direct representations of the world, not of reasoning processes. The arrows in the diagram represent real causal connections and not the flow of information during reasoning (as in rule-based systems and neural networks). Reasoning processes can operate on Bayesian networks by propagating information in any direction. For example, if the sprinkler is on, the pavement is probably wet (prediction, simulation). If someone slips on the pavement, that will also provide evidence that it is wet (abduction, reasoning to a probable cause, or diagnosis). On the other hand, if we see that the pavement is wet, that will make it more likely that the sprinkler is on or that it is raining (abduction); but if we then observe that the sprinkler is on, that will reduce the likelihood that it is raining (explaining away). It is the latter form of reasoning, explaining away, that is especially difficult to model in rule-based systems and neural networks in a natural way because it seems to require the propagation of information in two directions.

A Dynamic Bayesian Network Example

Entities that live in a changing environment must keep track of variables whose values change over time.

Dynamic Bayesian networks (DBN) are a generalization of Hidden Markov Models (HMM) and Kalman Filters (KF). Every HMM and KF can be represented with a DBN. Furthermore, the DBN representation of an HMM is much more compact and, thus, much easier to understand. The nodes in the HMM represent the states of the system, whereas the nodes in the DBN represent the dimensions of the system. For example, the HMM representation of the valve system shown in the following graph is made of 26 nodes and 36 arcs versus 9 nodes and 11 arcs in the DBN (Weber and Jouffe, 2003).

Representation of the Joint Probability Distribution

Any complete probabilistic model of a domain must — either explicitly or implicitly — represent the Joint Probability Distribution (JPD), i.e., the probability of every possible event as defined by the combination of the values of all the variables.

The global semantics of Bayesian networks specifies that the full Joint Probability Distribution (JPD) is given by the product rule (or chain rule):

In our example network, we have the following:

It becomes clear that the number of parameters grows linearly with the size of the network, i.e., the number of variables. In contrast, the size of the Joint Probability Distribution (JPD) itself grows exponentially. Given a discrete representation of the CPD with a Conditional Probability Table (CPT), the size of a local CPD grows exponentially with the number of parents. Savings can be achieved using compact CPD representations—such as noisy-OR models, trees, or neural networks.

The collection of independence assertions formed in this way suffices to derive the global assertion of the product rule and vice versa. The local semantics is most useful for constructing Bayesian networks because selecting as parents all the direct causes (or direct relationships) of a given variable invariably satisfies the local conditional independence conditions. The global semantics leads directly to a variety of algorithms for reasoning.

Evidential Reasoning

From the product rule, one can express the probability of any desired proposition in terms of the conditional probabilities specified in the network. For example, the probability that the Sprinkler is on given that the Pavement is slippery is:

These expressions can often be simplified in ways that reflect the structure of the network itself. The first algorithms proposed for probabilistic calculations in Bayesian networks used a local distributed message-passing architecture, typical of many cognitive activities. Initially, this approach was limited to tree-structured networks but was later extended to general networks in Lauritzen and Spiegelhalter’s (1988) method of junction tree propagation. A number of other exact methods have been developed and can be found in recent textbooks.

It is easy to show that reasoning in Bayesian networks subsumes the satisfiability problem in propositional logic and, therefore, exact inference is NP-hard. Monte Carlo simulation methods can be used for approximate inference (Pearl, 1988), giving gradually improving estimates as sampling proceeds. Unlike junction-tree methods, these methods use local message propagation on the original network structure. Alternatively, variational methods provide bounds on the true probability.

Causal Reasoning

Causal networks are more properly defined, then, as Bayesian networks in which the correct probability model — after intervening to fix any node’s value — is given simply by deleting links from the node’s parents. For example, Fire → Smoke is a causal network, whereas Smoke → Fire is not, even though both networks are equally capable of representing any Joint Probability Distribution (JPD) of the two variables.

Causal networks model the environment as a collection of stable component mechanisms. These mechanisms may be reconfigured locally by interventions, with corresponding local changes in the model. This, in turn, allows causal networks to be used very naturally for prediction by an agent that is considering various courses of action.

Learning Bayesian Network Parameters

In pure Bayesian approaches, Bayesian networks are designed from expert knowledge and include hyperparameter nodes. Data (usually scarce) is used as pieces of evidence for incrementally updating the distributions of the hyperparameters (Bayesian Updating).

Learning Bayesian Network Structure

It is also possible to machine learn the structure of a Bayesian network, and two families of methods are available for that purpose. The first one, using constraint-based algorithms, is based on the probabilistic semantics of Bayesian networks. Links are added or deleted according to the results of statistical tests, which identify marginal and conditional independencies. The second approach, using score-based algorithms, is based on a metric that measures the quality of candidate networks with respect to the observed data. This metric trades off network complexity against the degree of fit to the data, which is typically expressed as the likelihood of the data given the network.

As a substrate for learning, Bayesian networks have the advantage that it is relatively easy to encode prior knowledge in network form by fixing portions of the structure, forbidding relations, or using prior distributions over the network parameters. Such prior knowledge can allow a system to learn accurate models from much less data than is required for clean sheet approaches.

Causal Discovery

One of the most exciting prospects in recent years has been the possibility of using Bayesian networks to discover causal structures in raw statistical data—a task previously considered impossible without controlled experiments. Consider, for example, the following intransitive pattern of dependencies among three events: A and B are dependent, B and C are dependent, yet A and C are independent. If you asked a person to supply an example of three such events, the example would invariably portray A and C as two independent causes and B as their common effect, namely A → B ← C. For instance, A and C could be the outcomes of two fair coins, and B represents a bell that rings whenever either coin comes up heads.

Fitting this dependence pattern with a scenario where B is the cause and A and C are the effects is mathematically feasible but unnatural because it must entail fine-tuning the probabilities involved. The desired dependence pattern will be destroyed as soon as the probabilities change slightly.

Such thought experiments tell us that certain patterns of dependency, which are totally void of temporal information, are conceptually characteristic of certain causal directionalities and not others. When put together systematically, such patterns can be used to infer causal structures from raw data and to guarantee that any alternative structure compatible with the data must be less stable than the one(s) inferred; namely, slight fluctuations in parameters will render that structure incompatible with the data.

Despite recent advances, causal discovery is an active research area with countless unresolved questions. Thus, no generally accepted causal discovery algorithms are currently available for applied researchers. As a result, all causal networks presented in this book are constructed from expert knowledge or machine learning and then validated as causal by experts. The assumptions necessary for a causal interpretation of a Bayesian network will be discussed in Chapter 10.

Chapter 3: BayesiaLab

While the conceptual advantages of Bayesian networks had been known in the world of academia for some time, leveraging these properties for practical research applications was very difficult for non-computer scientists before BayesiaLab’s first release in 2002.

BayesiaLab is a powerful desktop application (Windows/macOS/Unix/Linux) with a sophisticated graphical user interface, which provides scientists with a comprehensive “laboratory” environment for machine learning, knowledge modeling, diagnosis, analysis, simulation, and optimization. With BayesiaLab, Bayesian networks have become practical for gaining deep insights into problem domains. BayesiaLab leverages the inherently graphical structure of Bayesian networks for exploring and explaining complex problems. The screenshot below shows a typical research project.

BayesiaLab is the result of nearly twenty years of research and software development by Dr. Lionel Jouffe and Dr. Paul Munteanu. In 2001, their research efforts led to the formation of Bayesia S.A.S., headquartered in Laval in northwestern France. Today, the company is the world’s leading supplier of Bayesian network software, serving hundreds of major corporations and research organizations around the world.

BayesiaLab’s Methods, Features, and Functions

As conceptualized in the diagram below, BayesiaLab is designed around a prototypical workflow with a Bayesian network model at the center. BayesiaLab supports the research process from model generation to analysis, simulation, and optimization. The entire process is fully contained in a uniform “lab” environment, which allows scientists to move back and forth between different elements of the research task.

The following "map of analytic modeling and reasoning" shows how our claim of how “universal modeling capability” translates into specific functions provided by BayesiaLab, which are placed as blue boxes on this map.

Knowledge Modeling

Subject matter experts often express their causal understanding of a domain through diagrams with arrows indicating causal directions. This visual representation of causes and effects has a direct analog in the network graph in BayesiaLab. Nodes (representing variables) can be added and positioned on BayesiaLab’s Graph Panel with a mouse click, arcs (representing relationships) can be “drawn” between nodes. As in the following network graph, the causal direction can be encoded by orienting the arcs from cause to effect.

Discrete, Nonlinear, and Nonparametric Modeling

BayesiaLab contains all “parameters” describing probabilistic relationships between variables in conditional probability tables (CPT), which means that no functional forms are utilized. Given this nonparametric, discrete approach, BayesiaLab can conveniently handle nonlinear relationships between variables. However, this CPT-based representation requires a preparation step for dealing with continuous variables, namely discretization. This consists of manually or automatically defining a discrete representation of all continuous values. BayesiaLab offers several tools for discretization, which are accessible in the Data Import Wizard, in the Node Editor (shown below), and in a standalone Discretization function. Univariate, bivariate, and multivariate discretization algorithms are available in this context.

Missing Values Processing

Parameter Estimation

Parameter Estimation with BayesiaLab is at the intersection of theory-driven and data-driven modeling. For a network that was generated either from expert knowledge or through machine learning, BayesiaLab can use the observations contained in an associated dataset to populate the CPT via Maximum Likelihood Estimation.

Bayesian Updating

In general, Bayesian networks are nonparametric models. However, a Bayesian network can also serve as a parametric model if an expert uses equations for defining local CPDs and, additionally, specifies hyperparameters, i.e., nodes that explicitly represent parameters that are used in the equations.

As opposed to BayesiaLab’s usual parameter estimation via Maximum Likelihood, the associated dataset provides pieces of evidence for incrementally updating—via probabilistic inference—the distributions of the hyperparameters.

Machine Learning

Despite our repeated emphasis on the relevance of human expert knowledge, especially for identifying causal relations, much of this book is dedicated to acquiring knowledge from data through machine learning. BayesiaLab features a comprehensive array of highly optimized learning algorithms that can quickly uncover structures in datasets. The optimization criteria in BayesiaLab’s learning algorithms are based on information theory (e.g., the Minimum Description Length). With that, no assumptions regarding the variable distributions are made. These algorithms can be used for all kinds and all sizes of problem domains, sometimes including thousands of variables with millions of potentially relevant relationships.

Unsupervised Structural Learning (Quadrant 2/3)

In statistics, “unsupervised learning” is typically understood to be a classification or clustering task. To make a clear distinction, we emphasize “structural” in “Unsupervised Structural Learning,” which covers a number of important algorithms in BayesiaLab.

Unsupervised Structural Learning means that BayesiaLab can discover probabilistic relationships between many variables without having to specify input or output nodes. One might say that this is a quintessential form of knowledge discovery, as no assumptions are required to perform these algorithms on unknown datasets.

Supervised Learning (Quadrant 2)

Supervised Learning in BayesiaLab has the same objective as many traditional modeling methods, i.e., to develop a model for predicting a target variable. Note that numerous statistical packages also offer “Bayesian Networks” as a predictive modeling technique. However, in most cases, these packages are restricted in their capabilities to one type of network, i.e., the Naive Bayes network. BayesiaLab offers a much greater number of Supervised Learning algorithms to search for the Bayesian network that best predicts the target variable while also considering the complexity of the resulting network.

Clustering (Quadrant 2/3)

Clustering in BayesiaLab covers both Data Clustering and Variable Clustering. The former applies to the grouping of records (or observations) in a dataset; the latter performs a grouping of variables according to the strength of their mutual relationships.

The third variation of this concept is of particular importance in BayesiaLab: Multiple Clustering can be characterized as a nonlinear, nonparametric, and nonorthogonal factor analysis. Multiple Clustering often serves as the basis for developing Probabilistic Structural Equation Models (Quadrant 3/4) with BayesiaLab.

Inference: Diagnosis, Prediction, and Simulation

The inherent ability of Bayesian networks to explicitly model uncertainty makes them suitable for a broad range of real-world applications. In the Bayesian network framework, diagnosis, prediction, and simulation are identical computations. They all consist of observational inference conditional upon evidence:

This distinction, however, only exists from the perspective of the researcher, who would presumably see the symptom of a disease as the effect and the disease itself as the cause. Hence, carrying out inference based on observed symptoms is interpreted as a “diagnosis.”

Observational Inference (Quadrant 1/2)

One of the central benefits of Bayesian networks is that they compute inference “omnidirectionally.” Given an observation with any type of evidence on any of the networks’ nodes (or a subset of nodes), BayesiaLab can compute the posterior probabilities of all other nodes in the network, regardless of arc direction. Both exact and approximate observational inference algorithms are implemented in BayesiaLab. We briefly illustrate evidence-setting and inference with the expert system network shown below:

Inference from effect to cause: diagnosis or abduction.
Inference from cause to effect: simulation or prediction.

Types of Evidence

Hard Evidence

Hard Evidence has no uncertainty regarding the state of the variable (node), e.g., P(Smoker=True)=100%.

Probabilistic Evidence

Probabilistic Evidence (or Soft Evidence) is defined by marginal probability distributions: P(Bronchitis=True)=66.67%.

Numerical Evidence

Numerical Evidence for numerical variables or for categorical/symbolic variables that have associated numerical values. BayesiaLab computes a marginal probability distribution to generate the specified expected value: E(Age)=39.

Likelihood Evidence

Likelihood Evidence (or Virtual Evidence) is defined by the likelihood of each state, ranging from 0%, i.e., from impossible to 100%, which means that no evidence reduces the probability of the state. The sum of the likelihoods must be greater than 0 to be valid as evidence. Also, note that the upper boundary for the sum of the likelihoods equals the number of states. Setting the same likelihood to all states corresponds to no evidence.

Causal Inference (Quadrant 3/4)

Effects Analysis (Quadrants 3/4)

Many research activities focus on estimating the size of an effect, e.g., to establish the treatment effect of a new drug or to determine the sales boost from a new advertising campaign. Other studies attempt to decompose observed effects into their causes, i.e., they perform attribution.

BayesiaLab performs simulations to compute effects, as parameters as such do not exist in this nonparametric framework. As all the domain dynamics are encoded in discrete CPTs, effect sizes only manifest themselves when different conditions are simulated. Total Effects Analysis, Target Mean Analysis, and several other functions offer ways to study effects, including nonlinear and variable interactions.

Optimization (Quadrant 4)

BayesiaLab’s ability to perform inference over all possible states of all nodes in a network also provides the basis for searching for node values that optimize a target criterion. BayesiaLab’s Target Dynamic Profile and Target Optimization are a set of tools for this purpose.

Model Utilization

BayesiaLab provides a range of functions for systematically utilizing the knowledge contained in a Bayesian network. They make a network accessible as an expert system that can be queried interactively by an end-user or through an automated process.

The Adaptive Questionnaire guides in terms of the optimum sequence for seeking evidence. BayesiaLab determines dynamically, given the evidence already gathered, the next best piece of evidence to obtain in order to maximize the information gain with respect to the target variable while minimizing the cost of acquiring such evidence. In a medical context, for instance, this would allow for the optimal “escalation” of diagnostic procedures from “low-cost/small-gain” evidence (e.g., measuring the patient’s blood pressure) to “high-cost/large-gain” evidence (e.g., performing an MRI scan). The Adaptive Questionnaire will be presented in the context of an example of tumor classification in Chapter 6.

The WebSimulator is a platform for publishing interactive models and Adaptive Questionnaires via the web, which means that any Bayesian network model built with BayesiaLab can be shared privately with clients or publicly with a broader audience. Once a model is published via the WebSimulator, end users can try out scenarios and examine the dynamics of that model.

Batch Inference is available for automatically performing inference on a large number of records in a dataset. For example, Batch Inference can be used to produce a predictive score for all customers in a database. With the same objective, BayesiaLab’s optional Export function can translate predictive network models into static code that can run in external programs. Modules are available that can generate code for R, SAS, PHP, VBA, and JavaScript.

Developers can also access many of BayesiaLab’s functions—outside the graphical user interface—using the Bayesia Engine API. The Bayesia Modeling Engine allows constructing and editing networks. The Bayesia Inference Engine can access network models programmatically for performing automated inference, e.g., as part of a real-time application with streaming data. The Bayesia Engine API was recently augmented with a model learning capability, which allows programmatic access to BayesiaLab's learning algorithms. This functionality is ideally suited for machine-learning models from streaming data.

The Bayesia Engine API is implemented as pure Java class libraries (jar files), which can be integrated into any software project.

Knowledge Communication

While generating a Bayesian network, either by expert knowledge modeling or through machine learning, is all about a computer acquiring knowledge, a Bayesian network can also be a remarkably powerful tool for humans to extract or “harvest” knowledge. Given that a Bayesian network can serve as a high-dimensional representation of a real-world domain, BayesiaLab allows us to interactively—even playfully—engage with this domain to learn about it. Through visualization, simulation, and analysis functions, plus the graphical nature of the network model itself, BayesiaLab becomes an instructional device that can effectively retrieve and communicate the knowledge contained within the Bayesian network. As such, BayesiaLab becomes a bridge between artificial intelligence and human intelligence.

Chapter 4: Knowledge Modeling & Probabilistic Reasoning

This chapter presents a workflow for encoding expert knowledge and subsequently performing omnidirectional probabilistic inference in the context of a real-world reasoning problem. While Chapter 1 provided a general motivation for using Bayesian networks as an analytics framework, this chapter highlights the perhaps unexpected relevance of Bayesian networks for reasoning in everyday life. The example proves that “common-sense” reasoning can be rather tricky. On the other hand, encoding “common-sense knowledge” in a Bayesian network turns out to be uncomplicated. We want to demonstrate that reasoning with Bayesian networks can be as straightforward as doing arithmetic with a spreadsheet.

Background & Motivation

Complexity & Cognitive Challenges

It is presumably fair to state that reasoning in complex environments creates cognitive challenges for humans. Adding uncertainty to our observations of the problem domain, or even considering the uncertainty regarding the structure of the domain itself, makes matters worse. When uncertainty blurs so many premises, it can be particularly difficult to find a common reasoning framework for a group of stakeholders.

No Data, No Analytics.

If we had hard observations from our domain in the form of data, it would be quite natural to build a traditional analytic model for decision support. However, the real world often yields only fragmented data or no data at all. It is not uncommon that we merely have the opinions of individuals who are more or less familiar with the problem domain.

To an Analyst With Excel, Every Problem Looks Like Arithmetic.

In the business world, spreadsheets are typically used to model the relationships between variables in a problem domain. Also, in the absence of hard observations, it is reasonable that experts provide assumptions instead of data. Any such expert knowledge is typically encoded as single-point estimates and formulas. However, using single values and formulas instantly oversimplifies the problem domain: firstly, the variables, and the relationships between them, become deterministic; secondly, the left-hand side versus right-hand side nature of formulas restricts inference to only one direction.

Taking No Chances!

Since spreadsheets' cells and formulas are deterministic and only work with single-point values, they are well suited for encoding “hard” logic but not at all for “soft” probabilistic knowledge that includes uncertainty. As a result, any uncertainty has to be addressed with workarounds, often in the form of trying out multiple scenarios or by working with simulation add-ons.

It Is a One-Way Street!

The lack of omnidirectional inference, however, may be the bigger issue in spreadsheets. As soon as we create a formula linking two cells in a spreadsheet, e.g., B1=function(A1), we preclude any evaluation in the opposite direction, from B1 to A1.

Assuming that A1 is the cause and B1 is the effect, we can indeed use a spreadsheet for inference in the causal direction, i.e., perform a simulation. However, even if we were certain about the causal direction between them, unidirectionality would remain a concern. For instance, if we were only able to observe the effect B1, we could not infer the cause A1, i.e., we could not perform a diagnosis from effect to cause. The one-way nature of spreadsheet computations prevents this.

Bayesian Networks to the Rescue!

Bayesian networks are probabilistic by default and handle uncertainty “natively.” A Bayesian network model can work directly with probabilistic inputs and probabilistic relationships and deliver correctly computed probabilistic outputs. Also, whereas traditional models and spreadsheets are of the form y=f(x), Bayesian networks do not have to distinguish between independent and dependent variables. Rather, a Bayesian network represents the entire joint probability distribution of the system under study. This representation facilitates omnidirectional inference, which we typically require for reasoning about a complex problem domain, such as the example in this chapter.

Example: Where is My Bag?

While most other examples in this book resemble proper research topics, we present a rather casual narrative to introduce probabilistic reasoning with Bayesian networks. It is a common situation taken straight from daily life, for which a “common-sense interpretation” may appear more natural than our proposed formal approach. As we shall see, dealing formally with informal knowledge provides a robust basis for reasoning under uncertainty.

Did My Checked Luggage Make the Connection?

Most travelers will be familiar with the following hypothetical situation or something fairly similar: You are traveling from Singapore to Los Angeles and need to make a flight connection in Tokyo. Your first flight segment from Singapore to Tokyo is significantly delayed, and you arrive in Tokyo with barely enough time to make the connection. You have to run from Terminal 1, where you just landed, to Terminal 2, where your flight to Los Angeles will depart. The boarding process is already underway by the time you get to the departure gate for Los Angeles.

Problem #1

Out of breath, you check in with the gate agent, who informs you that the luggage you checked in at Singapore may or may not make the connection. She apologetically states there is only a 50/50 chance you will get your bag upon arrival at your destination airport, Los Angeles.

Once you have landed in Los Angeles, you head straight to the baggage claim and wait for the first pieces of luggage to appear on the baggage carousel. Bags come down the chute onto the carousel at a steady rate. After five minutes of watching fellow travelers retrieve their luggage, you wonder what the chances are that you will ultimately get your bag. You reason that if the bag had indeed made it onto the plane, it would be increasingly likely to appear among the remaining pieces to be unloaded. However, you do not know for sure that your piece was actually on the plane. Then, you think, you better get in line to file a claim at the baggage office. Is that reasonable? How should you update your expectations about getting your bag as you wait?

Problem #2

As you contemplate your next move, you see a colleague picking up his suitcase. As it turns out, your colleague was traveling on the very same itinerary as you, i.e., Singapore – Tokyo – Los Angeles. His luggage made it, so you conclude that you better wait at the carousel for the very last piece to be delivered. How does the observation of your colleague’s suitcase change your belief in the arrival of your bag? Does all that even matter? After all, the bag either made the connection or not. The fact that you now observe something after the fact cannot influence what happened earlier, right?

Knowledge Modeling for Problem #1

This problem domain can be explained by a causal Bayesian network, only using a few common-sense assumptions. We demonstrate how to combine different pieces of available —but uncertain—knowledge into a network model. Our objective is to calculate the correct degree of belief in the arrival of your luggage as a function of time and your own observations.

Per our narrative, we obtain the first piece of information from the gate agent in Tokyo who manages the departure to Los Angeles. She says there is a 50/50 chance that your bag is on the plane. More formally, we express this as:

�(��=��)=0.5

We encode this probabilistic knowledge in a Bayesian network by creating a node. In BayesiaLab, we click the Node Creation Mode icon () and then point to the desired position on the Graph Panel.

Once the node is in place, we update its name to “Your Bag on Plane” by double-clicking the default name N1. Then, by double-clicking the node itself, we open BayesiaLab’s Node Editor. Under the tab Probability Distribution > Probabilistic, we define the probability that Your Bag on Plane=True, which is 50%, as per the gate agent’s statement. Given that these probabilities do not depend on any other variables, we speak of marginal probabilities. Note that in BayesiaLab, probabilities are always expressed as percentages:

Assuming there is no other opportunity for losing luggage within the destination airport, your chance of ultimately receiving your bag should be identical to the probability of your bag being on the plane, i.e., on the flight segment to your final destination airport. More simply, if it is on the plane, then you will get it:

�(��=��|��=��)=1�(��=��|��=��)=0

Conversely, the following must hold too:

�(��=��|��=��)=1�(��=��|��=��)=0

We now encode this knowledge into our network. We add a second node, Your Bag on Carousel, and then click the Arc Creation Mode icon . Next, we click and hold the cursor on Your Bag on Plane, drag the cursor to Your Bag on Carousel, and finally release. This produces a simple, manually specified Bayesian network:

The yellow warning triangle indicates that probabilities need to be defined for the node Your Bag on Carousel. Unlike the previous instance, where we only had to enter marginal probabilities, we now need to define the probabilities of the states of the node Your Bag on Carousel conditional on the states of Your Bag on Plane. In other words, we need to fill the Conditional Probability Table to quantify this parent-child relationship. We open the Node Editor and enter the values from the equations above.

Introduction of Time

Now we add another piece of contextual information that has not been mentioned yet in our story. From the baggage handler who monitors the carousel, you learn that 100 pieces of luggage in total were on your final flight segment, from the hub to the destination. After you wait for one minute, 10 bags have appeared on the carousel, and they keep coming out at a very steady rate. However, yours is not among the first ten that were delivered in the first minute. At the current rate, it would now take 9 more minutes for all bags to be delivered to the baggage carousel.

Given that your bag was not delivered in the first minute, what is your new expectation of ultimately getting your bag? How about after the second minute of waiting? Quite obviously, we need to introduce a time variable into our network. We create a new node Time and define discrete time intervals [0,...,10] to serve as its states.

By default, all new nodes initially have two states, True and False. We can see this by opening the Node Editor and selecting the States tab:

By clicking on the Generate States button, we create the states we need for our purposes. Here, we define 11 states, starting at 0 and increasing by 1 step:

The Node Editor now shows the newly-generated states:

Beyond defining the states of Time, we also need to define their marginal probability distribution. For this, we select the tab Probability Distribution > Probabilistic. Naturally, no time interval is more probable than another one, so we should apply a uniform distribution across all states of Time. BayesiaLab provides a convenient shortcut for this purpose. Clicking the Normalize button places a uniform distribution across all cells, i.e., 9.091% per cell.

Once Time is defined, we draw an arc from Time to Your Bag on Carousel. By doing so, we introduce a causal relationship, stating that Time influences the status of your bag.

The warning triangle once again indicates that we need to define further probabilities concerning Your Bag on Carousel. We open the Node Editor to enter these probabilities into the Conditional Probability Table:

Note that the probabilities of the states True and False now depend on two-parent nodes. For the upper half of the table, it is still quite simple to establish the probabilities. If the bag is not on the plane, it will not appear on the baggage carousel under any circumstance, regardless of Time. Hence, we set False to 100 (%) for all rows in which Your Bag on Plane=False.

However, given that Your Bag on Plane=True, the probability of seeing it on the carousel depends on the time elapsed. Now, what is the probability of seeing your bag at each time step? Assuming that all luggage is shuffled extensively through the loading and unloading processes, there is a uniform probability distribution that the bag is anywhere in the pile of luggage to be delivered to the carousel. As a result, there is a 10% chance that your bag is delivered in the first minute, i.e. within the first batch of 10 out of 100 luggage pieces. Over a period of two minutes, there is a 20% probability that the bag arrives, and so on. Only when the last batch of 10 bags remains undelivered, we can be certain that your bag is in the final batch, i.e. there is a 100% probability of the state True in the tenth minute. We can now fill out the Conditional Probability Table in the Node Editor with these values. Note that we only need to enter the values in the True column and then highlight the remaining empty cells. Clicking Complete prompts BayesiaLab to automatically fill in the False column to achieve a row sum of 100%:

Now we have a fully specified Bayesian network, which we can evaluate immediately.

Evidential Reasoning for Problem #1

BayesiaLab’s Validation Mode provides the tools for using the Bayesian network we built for omnidirectional inference. We switch to the Validation Mode via the corresponding icon , in the lower left-hand corner of the main window, or via the keyboard shortcut F5:

Upon switching to this mode, we double-click on all three nodes to bring up their associated Monitors, which show the nodes’ current marginal probability distributions. We find these Monitors inside the Monitor Panel on the right-hand side of the main window:

Inference Tasks

If we filled the Conditional Probability Table correctly, we should now be able to validate at least the trivial cases straight away, e.g. for Your Bag on Plane=False.

Inference from Cause to Effect: Your Bag on Plane=False

We perform inference by setting such evidence via the corresponding Monitor in the Monitor Panel. We double-click the bar that represents the State False:

The setting of the evidence turns the node and the corresponding bar in the Monitor green:

The Monitor for Your Bag on Carousel shows the result. The small gray arrows overlaid on top of the horizontal bars furthermore indicate how the probabilities have changed by setting this most recent piece of evidence:

Indeed, your bag could not possibly be on the carousel because it was not on the plane in the first place. The inference we performed here is indeed trivial, but it is reassuring to see that the Bayesian network properly “plays back” the knowledge we entered earlier.

Omnidirectional Inference: Your Bag on Carousel=False, Time=1

The next question, however, typically goes beyond our intuitive reasoning capabilities. We wish to infer the probability that your bag made it onto the plane, given that we are now in minute 1, and the bag has not yet appeared on the carousel. This inference is tricky because we now have to reason along multiple paths in our network.

Diagnostic Reasoning

The first path is from Your Bag on Carousel to Your Bag on Plane. This type of reasoning from effect to cause is more commonly known as diagnosis. More formally, we can write:

�(��=��|��=��)

Inter-Causal Reasoning

The second reasoning path is from Time via Your Bag on Carousel to Your Bag on Plane. Once we condition on Your Bag on Carousel, i.e. by observing the value, we open this path, and information can flow from one cause, Time, via the common effect, Your Bag on Carousel, to the other cause, Your Bag on Plane. Hence, we speak of “inter-causal reasoning” in this context. The specific computation task is:

�(��=��|��=��,��=1)

Bayesian Networks as Inference Engine

How do we go about computing this probability? We do not attempt to perform this computation ourselves. Rather, we rely on the Bayesian network we built and BayesiaLab’s exact inference algorithms. However, before we can perform this inference computation, we need to remove the previous piece of evidence, i.e. Your Bag on Plane=True. We do this by right-clicking the relevant node and then selecting Remove Evidence from the Contextual Menu. Alternatively, we can remove all evidence by clicking the Remove All Observations icon .

Then, we set the new observations via the Monitors in the Monitor Panel. The inference computation then happens automatically.

Given that you do not see your bag in the first minute, the probability that your bag made it onto the plane is now no longer at the marginal level of 50% but is reduced to 47.37%.

Inference as a Function of Time

Continuing with this example, how about if the bag has not shown up in the second minute, in the third minute, etc.? We can use one of BayesiaLab’s built-in visualization functions to analyze this automatically. To prepare the network for this type of analysis, we first need to set a Target Node, which, in our case, is Your Bag on Plane. Upon right-clicking this node, we select Set as Target Node. Alternatively, we can double-click the node, or one of its states in the corresponding Monitor, while holding T.

Upon setting the Target Node, Your Bag on Plane is marked with a bullseye symbol. Also, the corresponding Monitor is now highlighted in red. Before we continue, however, we need to remove the evidence from the Time Monitor. We do so by right-clicking the Monitor and selecting Remove Evidence from the Contextual Menu.

Then, we select Analysis > Visual > Influence Analysis on Target Node.

The resulting graph shows the probabilities of receiving your bag as a function of the discrete time steps. To see the progression of the True state, we select the corresponding tab at the top of the window.

Knowledge Modeling for Problem #2

Continuing with our narrative, you now notice a colleague of yours in the baggage claim area. As it turns out, your colleague was traveling on the same itinerary as you, i.e., Singapore – Tokyo – Los Angeles, so he had to make the same tight connection. Unlike you, he has already retrieved his bag from the carousel. You assume that his luggage being on the airplane is not independent of your luggage being on the same plane, so you take this as a positive sign. How do we formally integrate this assumption into our existing network?

To encode any new knowledge, we first need to switch back to the Modeling Mode (F4). Then, we duplicate the existing nodes Your Bag on Plane and Your Bag on Carousel by copying and pasting them into the same Graph Panel using the common shortcuts, Ctrl+C and Ctrl+V.

In the copy process, BayesiaLab prompts us for a Copy Format, which would only be relevant if we intended to paste the selected portion of the network into another application, such as PowerPoint. As we paste the copied nodes into the same Graph Panel, the format does not matter.

Upon pasting, by default, the new nodes have the same names as the original ones plus the suffix “[1]”.

Next, we reposition the nodes on the Graph Panel and rename them to show that the new nodes relate to your colleague’s situation, rather than yours. To rename the nodes we double-click the Node Names and overwrite the existing label.

The next assumption is that your colleague’s bag is subject to exactly the same forces as your luggage. More specifically, the successful transfer of your and his luggage is a function of how many bags could be processed in Tokyo given the limited transfer time. To model this, we introduce a new node and name it Transit.

We create 7 states of ten-minute intervals for this node, which reflect the amount of time available for the transfer, i.e., from 0 to 60 minutes.

Furthermore, we set the probability distribution for Transit. For expository simplicity, we apply a uniform distribution using the Normalize button.

Now that the Transit node is defined, we can draw the arcs connecting it to Your Bag on Plane and Colleague’s Bag on Plane.

The yellow warning triangles indicate that the Conditional Probability Tables of Your Bag on Plane and Colleague’s Bag on Plane have yet to be filled. Thus, we need to open the Node Editor and set these probabilities. We will assume that the probability of your bag making the connection is 0% given a Transit time of 0 minutes and 100% with a Transit time of 60 minutes. Between those values, the probability of a successful transfer increases linearly with time.

The very same function also applies to your colleague’s bag, so we enter the same conditional probabilities for the node Colleague’s Bag on Plane by copying and pasting the previously entered table.

Evidential Reasoning for Problem #2

Now that the probabilities are defined, we switch to the Validation Mode (F5); our updated Bayesian network is ready for inference again.

We simulate a new scenario to test this new network. For instance, we move to the fifth minute and set evidence that your bag has not yet arrived.

Given these observations, the probability of Your Bag on Plane=True is now 33.33%. Interestingly, the probability of Colleague’s Bag on Plane has also changed. As evidence propagates omnidirectionally through the network, our two observed nodes do indeed influence Colleague’s Bag on Plane. A further iteration of the scenario in our story is that we observe Colleague’s Bag on Carousel=True, also in the fifth minute.

Given the observation of Colleague’s Bag on Carousel, even though we have not yet seen Your Bag on Carousel, the probability of Your Bag on Plane increases to 56.52%. Indeed, this observation should change your expectation quite a bit. The small gray arrows on the blue bars inside the Monitor for Your Bag on Plane indicate the impact of this observation.

After removing the evidence from the Time Monitor, we can perform Influence Analysis on Target again in order to see the probability of Your Bag on Plane=True as a function of Time, given Your Bag on Carousel=False and Colleague’s Bag on Carousel=True. To focus our analysis on Time alone, we select the Time node and then select Analysis > Visual > Influence Analysis on Target.

As before, we select the True tab in the resulting window to see the evolution of probabilities given Time.

Summary

This chapter provided a brief introduction to knowledge modeling and evidential reasoning with Bayesian networks in BayesiaLab. Bayesian networks can formally encode available knowledge, deal with uncertainties, and perform omnidirectional inference. As a result, we can properly reason about a problem domain despite many unknowns.

Chapter 5: Bayesian Networks and Data

For machine learning with BayesiaLab, concepts derived from information theory, such as entropy and mutual information, are particularly important and should be understood by the researcher. However, these measures are not nearly as familiar to most scientists as common statistical measures, e.g., covariance and correlation.

Example: House Prices in Ames, Iowa

We present a straightforward research task to introduce these presumably unfamiliar information-theoretic concepts. The objective is to establish the predictive importance of a range of variables concerning a target variable. The domain of this example is residential real estate, and we wish to examine the relationships between home characteristics and sales prices. In this context, it is natural to ask questions related to variable importance, such as, which is the most important predictive variable pertaining to home value? By attempting to answer this question, we can explain what entropy and mutual information mean in practice and how BayesiaLab computes these measures. In this process, we also demonstrate a number of BayesiaLab’s data-handling functions.

The dataset for this chapter’s exercise describes the sale of individual residential properties in Ames, Iowa, from 2006 to 2010. It contains a total of 2,930 observations and a large number of explanatory variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous). This dataset was first used by De Cock (2011) as an educational tool for statistics students. The objective of their study was the same as ours, i.e., modeling sale prices as a function of the property attributes.

To make this dataset more convenient for demonstration purposes, we reduced the total number of variables to 49. This pre-selection was fairly straightforward as numerous variables essentially do not apply to homes in Ames, e.g., variables relating to pool quality and pool size (there are practically no pools) or roof material (it is the same for virtually all homes).

The Workflow in Detail

References

Dean De Cock. Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project. Journal of Statistical Education, 19(3), 2011.

Data Import and Discretization

Open Data

As the first step, we start BayesiaLab’s Data Import Wizard by selecting Main Menu > Data > Open Data Source > Text File.

Then, we select the file “AmesHousePriceData.csv”, a comma-delimited, flat text file, which you can download here:

AmesHousePriceData.csv

Data Import Wizard

This brings up the first screen of the Data Import Wizard, which previews the to-be-imported dataset.

For this example, the coding options for Missing Values and Filtered Values are particularly important. By default, BayesiaLab lists commonly used codes that indicate an absence of data, e.g., #NUL! or NR (non-response). In the Ames dataset, a blank field (“ ”) indicates a Missing Value, and “FV” stands for Filtered Value. These are recognized automatically. If other codes were used, we could add them to the respective lists on this screen.

Clicking Next, we proceed to the screen that allows us to define variable types.

BayesiaLab scans all variables in the database and provides a best guess regarding the variable type. Variables identified as Continuous are shown in turquoise, and those identified as Discrete are highlighted in pastel red.

In BayesiaLab, a Continuous variable contains a wide range of numerical values (discrete or continuous), which need to be transformed into a more limited number of discrete states. Some other variables in the database only have very few distinct numerical values to begin with, e.g., [1,2,3,4,5], and BayesiaLab automatically recognizes such variables as Discrete. For them, the number of numerical states is small enough that creating bins of values is unnecessary. Also, variables containing text values are automatically considered Discrete.

For this dataset, however, we need to make a number of adjustments to the suggested data types. For instance, we set all numerical variables to Continuous, including those highlighted in red that were originally identified as Discrete. As a result, all columns in the data preview of the Data Import Wizard are now shown in turquoise.

Given that our database contains some missing values, we need to select the type of Missing Values Processing in the next step. Instead of using ad hoc methods, such as pairwise or listwise deletion, BayesiaLab can leverage more sophisticated techniques and provide estimates (or temporary placeholders) for such missing values—without discarding any original data.

We will discuss Missing Values Processing in detail in Chapter 9. For this example, however, we leave the default setting of Structural EM.

Filtered Values

At this point, however, we must introduce a very special type of missing value for which we must not generate any estimates. We are referring to so-called Filtered Values. These are “impossible” values that do not or cannot exist—given a specific set of evidence, as opposed to values that do exist but are not observed. For example, for a home that does not have a garage, there cannot be any value for the variable Garage Type, such as Attached to Home, Detached from Home, or Basement Garage. If there is no garage, there cannot be a garage type. As a result, it makes no sense to calculate an estimate of a Filtered Value. In a database, unfortunately, a Filtered Value typically looks identical to “true” missing value that does exist but is not observed. The database typically contains the same code, such as a blank, NULL, N/A, etc., for both cases.

Therefore, instead of “normal” missing values, which can be left as-is in the database, we must mark Filtered Values with a specific code, e.g., “FV.” The Filtered Value declaration should be done during data preparation before importing any data into BayesiaLab. BayesiaLab will then add a Filtered State (marked with “*”) to the discrete states of the variables with Filtered Values and utilize a special approach for actively disregarding such Filtered States so that they are not taken into account during machine learning or for estimating effects.

Discretization

As the next step in the Data Import Wizard, all Continuous values must be discretized (or binned). We show a sequence of screenshots to highlight the necessary steps. The initial view of the Discretization and Aggregation step appears.

By default, the first column is highlighted, which happens to be SalePrice, the variable of principal interest in this example. Instead of selecting any available automatic discretization algorithms, we pick Manual from the Type drop-down menu, which brings up the Cumulative Distribution Function (CDF) of the SalePrice variable.

By clicking Density Function, we can bring up the Probability Density Function (PDF) of SalePrice.

Either view allows us to examine the distribution and identify any salient points. We stay on the current screen to set the thresholds for each discretization bin. In many instances, we would use an algorithm to define bins automatically unless the variable will serve as the target variable. In that case, we usually rely on available expert knowledge to define the binning. In this example, we wish to have evenly-spaced, round numbers for the interval boundaries. We add boundaries by right-clicking on the plot (right-clicking on an existing boundary removes it again). Furthermore, we can fine-tune a threshold’s position by entering a precise value in the Threshold Value field. We use {75000, 150000, 225000, 300000} as the interval boundaries.

Tree Discretization

Now that we have manually discretized the target variable SalePrice (column highlighted), we still need to discretize the remaining continuous variables. However, we will take advantage of an automatic discretization algorithm for those variables.

We click Select All Continuous. BayesiaLab automatically excludes SalePrice from this selection because we have already discretized it.

Numerous automatic discretization algorithms are available, but for the purpose of this example, we only consider the bivariate Tree discretization algorithm.

Please see the main entry for Discretization in this library for a detailed description of all available algorithms.

As its name suggests, the Tree discretization algorithm machine-learns a decision tree that uses the to-be-discretized variable for representing the conditional probability distributions of the target variable given the to-be-discretized variable. Once the Tree is learned, it is analyzed to extract the most useful thresholds. This is the method of choice in the context of Supervised Learning, i.e., when planning to machine-learn a model to predict the target variable.

At the same time, we do not recommend using Tree in the context of Unsupervised Learning. The Tree algorithm creates bins that are biased toward the designated target variable. Naturally, emphasizing one particular variable would run counter to the intent of Unsupervised Learning.

Note that if the to-be-discretized variable is independent of the target variable, it will be impossible to build a tree, and BayesiaLab will prompt the selection of a univariate discretization algorithm.

In this example, we focus our analysis on SalePrice, which can be considered a type of Supervised Learning. Therefore, we discretize all continuous variables with the Tree algorithm, using SalePrice as the Target variable. Note the Target must either be a Discrete variable or a Continuous variable that has already been manually discretized, which is the case for SalePrice.

Clicking Finish completes the import process.

The import process concludes with a pop-up window that offers to display the Import Report.

Clicking Yes brings up the Import Report, which can be saved in HTML format. It lists the discretization intervals of the Continuous variables, the States of the Discrete variables, and the discretization method used for each variable.

Graph Panel

Once we close out this report, we can see the result of the import process. All the imported variables are now represented as nodes on the Graph Panel. The dashed borders of some nodes indicate that the corresponding variables were discretized during data import.

The lack of warning icons on any nodes indicates that all their parameters, i.e., their marginal probability distributions, were automatically estimated upon data import.

To verify, we open the Node Editor of SalePrice (Node Context Menu > Edit > Probability Distribution > Probabilistic) and check the node’s marginal distribution.

Clicking on the Occurrences tab shows the observations per cell, which were used for the Maximum Likelihood Estimation of the marginal distribution.

Workflow Animation

The following animation shows all the above steps in a continuous workflow.

Node Names, Long Names, and Node Comments

The Node Names displayed by default are taken directly from the column header of the imported dataset. To keep the Graph Panel uncluttered, we will keep these "short" names as the formal Node Names. On the other hand, we may want to have longer, more descriptive names available when interpreting the network or presenting it to an audience.

BayesiaLab offers three levels of "node names" for each node:

The Node Name uniquely identifies a node and is displayed by default.
A Long Name can be displayed instead of the Node Name on the Graph Panel, on the Monitors in the Monitor Panel, on reports, and in the context of many analysis functions.
A Node Comment provides additional space for supplemental information about a node. For instance, if nodes represent survey responses, the Node Comment could accommodate the verbatim survey question.

Long Names

Long Names can be added to a network in two ways:

One by one for each node via the Properties tab of the Node Editor (Node Context Menu > Edit > Properties).
Using a Dictionary to provide Long Names for multiple nodes at once.

Dictionary

Given that we want to apply Long Names to 49 nodes, using a Dictionary will be much more convenient. The format of a Dictionary is rather straightforward:

We define a plain text file that includes one Node Name per line. Spaces and special characters in the Node Name require backslash "\" as an escape character.
Each Node Name is followed by a delimiter (“=”, tab, or space) and then by the Long Name.

Here is a preview of the Dictionary:

You can download the complete Dictionary file here:

To attach this Dictionary, select Main Menu > Data > Associate Dictionary > Node > Long Names.

Next, we select the Dictionary file, “AmesLongNames.txt”.

Upon loading the Dictionary file, the appearance of the network does not change. Only if an error occurred would a warning triangle appear in the lower right corner of the Graph Window. Also, any error details would be available in the Console.

We now have the option of turning on the Long Names for individual nodes or all nodes. For our purposes, we want to see the Long Names on all nodes:

Select all nodes, e.g., using Ctrl+A.
Node Context Monitor > Properties > Rendering Properties > Show Long Name.
Check the Show Long Name box in the pop-up window:

Click OK.

Instead of the "short" Node Names, BayesiaLab now displays the Long Names for all nodes.

Workflow Animation

Parameter Estimation

We start with a pair of nodes, namely Neighborhood and SalePrice. As opposed to LotArea, which is a discretized Continuous variable, Neighborhood is categorical, and, as such, it has been automatically treated as Discrete in BayesiaLab. This is the reason the node corresponding to Neighborhood has a solid border. We now add an arc between these two nodes to explicitly represent the dependency between them:

Counting all records, we obtain the marginal count of each state of Neighborhood.

Given that our Bayesian network structure says that Neighborhood is the parent node of SalePrice, we now count the states of SalePrice conditional on Neighborhood. This is simply a cross-tabulation.

Once we translate these counts into probabilities (by normalizing by the total number of occurrences for each row in the table), this table becomes a CPT. Together, the network structure (qualitative) and the CPTs (quantitative) comprise the Bayesian network.

In practice, however, we do not need to bother with these individual steps. Rather, BayesiaLab can automatically learn all marginal and conditional probabilities from the associated database. We select Main Menu > Learning > Parameter Estimation to perform this task.

This means that by knowing Neighborhood, we reduce our uncertainty regarding SalePrice by 32% on average. By knowing SalePrice, we reduce our uncertainty regarding Neighborhood by 14% on average. These values are readily interpretable. However, we need to know this for all nodes to determine which node is most important.

Missing Values Processing in BayesiaLab

Chapter 12: Attribution, Contribution, and Counterfactuals

Chapter 2: Bayesian Network Theory

This chapter is based mainly on Pearl and Russell (2000) and was adapted with permission.

History

Bayes' Theorem

Rev. Bayes addressed both the case of discrete probability distributions of data and the more complicated case of continuous probability distributions. In the discrete case, Bayes’ theorem relates the conditional and marginal probabilities of events $A$ and $B$ , provided that the probability of $B$ is not equal to zero:

P(A|B) = P(A) \times {{P(B|A)} \over {P(B)}}

In Bayes’ theorem, each probability has a conventional name: $P(A)$ is the prior probability (or “unconditional” or “marginal” probability) of $A$ . It is “prior” in the sense that it does not take into account any information about $B$ ; however, event $B$ need not occur after event $A$ . In the nineteenth century, the unconditional probability $P(A)$ in Bayes’ rule was called the “antecedent” probability; in deductive logic, the antecedent set of propositions and the inference rule imply consequences. Sir Ronald A. Fisher called the unconditional probability $P(A)$ “a priori.”

$P(A|B)$ is the conditional probability of $A$ , given $B$ . It is also called the posterior probability because it is derived from or depends upon the specified value of $B$ ;

${P(B|A)}$ is the conditional probability of $B$ given $A$ . It is also called the likelihood;

${P(B)}$ is the prior or marginal probability of $B$ and acts as a normalizing constant;

${{P(B|A)} \over {P(B)}}$ is the Bayes factor or likelihood ratio.

Bayes' theorem in this form represents how the conditional probability of event $A$ given $B$ is related to the converse conditional probability of $B$ given $A$ .

Motivation for Developing Bayesian Networks

Bayesian Network Elements

Bayesian networks are models that consist of two parts: a qualitative and a quantitive part.

Qualitative Part

The qualitative part is a Directed Acyclic Graph (DAG) that specifies the dependencies between variables.

Nodes represent variables of interest (e.g., the temperature of a device, the gender of a patient, a feature of an object, or the occurrence of an event). Such nodes can correspond to symbolic/categorical variables, numerical variables with discrete values, or discretized continuous variables.
Directed arcs represent statistical (informational) or causal dependencies between the nodes. The directions are used to define "kinship" relations, i.e., parent-child relationships. For example, with an arc from $X$ to $Y$ , $X$ is the parent node of $Y$ , and $Y$ is the child node.

Quantitive Part

The quantitative part is based on local probability distributions for specifying the probabilistic relationships between nodes.

A Non-Causal Bayesian Network Example

The above illustration shows a simple Bayesian network consisting of only two nodes and one arc. It represents the Joint Probability Distribution (JPD) of the variables $Eye\ Color$ and $Hair\ Color$ in a population of students (Snee, 1974).

In this case, the conditional probabilities of $Hair\ Color$ , given the values of its parent node, $Eye\ Color$ , are provided in a Conditional Probability Table (CPT). It is important to point out that this Bayesian network does not contain any causal assumptions, i.e., we do not know the causal order between the variables. Thus, the interpretation of this network should be merely statistical (informational).

EyeColorHairColor.xbl

A Causal Network Example

The above graph illustrates another simple yet typical Bayesian network. In contrast to the statistical relationships in the non-causal example, this graph describes the causal relationships among the seasons of the year $X_1$ , whether it is raining $X_2$ , whether the sprinkler is on $X_3$ , whether the pavement is wet $X_4$ , and whether the pavement is slippery $X_5$ . Here, the absence of a direct link between $X_1$ and $X_5$ , for example, captures our understanding that there is no direct influence of season on slipperiness. The influence is mediated by the wetness of the pavement (if freezing were possible, a direct link could be added).

Sprinkler-Network.xbl

A Dynamic Bayesian Network Example

Entities that live in a changing environment must keep track of variables whose values change over time.

Dynamic Bayesian networks capture this process by representing multiple copies of the state variables, one for each time step. A set of variables $X_{t-1}$ and $X_{t}$ denotes the world state at times $t_{-1}$ and $t$ , respectively. A set of evidence variables $E_t$ denotes the observations available at time $t$ . The sensor model $P(E_t|X_t)$ is encoded in the conditional probability distributions for the observable variables, given the state variables. The transition model $P(E_t|X_{t-1})$ relates the state at time $t_{-1}$ to the state at time t. Keeping track of the world means computing the current probability distribution over world states given all past observations, i.e., $P(X_t|E_1,...,E_t)$ .

Representation of the Joint Probability Distribution

There are exponentially many such events, yet Bayesian networks achieve compactness by factoring the Joint Probability Distribution (JPD) into local, conditional distributions for each variable given its parents. If $x_i$ denotes some value of the variable $X_i$ and $pa_i$ denotes some set of values for the parents of $X_i$ , then $P(x_i|pa_i)$ denotes this conditional probability distribution.

For example, in the graph below, $P(x_4|x_2, x_3)$ is the probability of Wetness given the values of Sprinkler and Rain.

The global semantics of Bayesian networks specifies that the full Joint Probability Distribution (JPD) is given by the product rule (or chain rule):

P({x_i},...,{x_n}) = \prod\limits_i {P({x_i}|p{a_i})}

In our example network, we have the following:

P({x_1},{x_2},{x_3},{x_4},{x_5}) = P({x_1})P({x_2}|{x_1})P({x_3}|{x_1})P({x_4}|{x_2},{x_3})P({x_5}|{x_4})

The Joint Probability Distribution (JPD) representation with Bayesian networks also translates into local semantics, which asserts that each variable is independent of descendants in the network given its parents. For example, the parents of $X_4$ in the following graph are $X_2$ and $X_3$ , and they render $X_4$ independent of the remaining non-descendant, $X_1$ :

P({x_4}|{x_1},{x_2},{x_3}) = P({x_4}|{x_2},{x_3})

Evidential Reasoning

\begin{array}{l} P({X_3} = on|{X_5} = true) = \frac{{P({X_3} = on,{X_5} = true)}}{{P({X_5} = true)}}\\ = \frac{{\sum\nolimits_{{x_1},{x_2},{x_4}} {P({x_1},{x_2},{X_3} = on,{x_4},{X_5} = true)} }}{{\sum\nolimits_{{x_1},{x_2},{x_3},{x_4}} {P({x_1},{x_2},{x_3},{x_4},{X_5} = true)} }}\\ = \frac{{\sum\nolimits_{{x_1},{x_2},{x_4}} {P({x_1})P({x_2}|{x_1})P({X_3} = on|{x_1})P({x_4}|{x_2},{X_3} = on)P({X_5} = true|{x_4})} }}{{\sum\nolimits_{{x_1},{x_2},{x_3},{x_4}} {P({x_1})P({x_2}|{x_1})P({x_3}|{x_1})P({x_4}|{x_2},{x_3})P({X_5} = true|{x_4})} }} \end{array}

Causal Reasoning

Most probabilistic models, including general Bayesian networks, describe a Joint Probability Distribution (JPD) over possible observed events but say nothing about what will happen if a certain intervention occurs. For example, what if I turn the Sprinkler on instead of just observing that it is turned on? What effect does that have on the Season or the connection between Wet and Slippery? A causal network, intuitively speaking, is a Bayesian network with the added property that the parents of each node are its direct causes. In such a network, the result of an intervention is obvious: the Sprinkler node is set to $X_3=on$ , and the causal link between Season $X_1$ and the Sprinkler $X_3$ is removed. All other causal links and conditional probabilities remain intact, so the new model is:

$P({x_1},{x_2},{x_3},{x_4}) = P({x_1})P({x_2}|{x_1})P({x_4}|{x_2},{X_3} = on)P({x_5}|{x_4})$

Notice that this differs from observing that $X_3=on$ , which would result in a new model that included the term $P(X_3=on|X_1)$ . This mirrors the difference between seeing and doing: after observing that the Sprinkler is on, we wish to infer that the Season is dry, that it probably did not rain, and so on. An arbitrary decision to turn on the Sprinkler should not result in any such beliefs.

Learning Bayesian Network Parameters

Given a qualitative Bayesian network structure, the conditional probability tables, $P(x_i|pa_i)$ , are typically estimated with the maximum-likelihood approach from the observed frequencies in the dataset associated with the network.

Learning Bayesian Network Structure

Causal Discovery

Chapter 4: Knowledge Modeling & Probabilistic Reasoning

Background & Motivation

Complexity & Cognitive Challenges

No Data, No Analytics.

To an Analyst With Excel, Every Problem Looks Like Arithmetic.

Taking No Chances!

It Is a One-Way Street!

Bayesian Networks to the Rescue!

Example: Where is My Bag?

Did My Checked Luggage Make the Connection?

Problem #1

Problem #2

Knowledge Modeling for Problem #1

�(��=��)=0.5

We encode this probabilistic knowledge in a Bayesian network by creating a node. In BayesiaLab, we click the Node Creation Mode icon () and then point to the desired position on the Graph Panel.

�(��=��|��=��)=1�(��=��|��=��)=0

Conversely, the following must hold too:

�(��=��|��=��)=1�(��=��|��=��)=0

Introduction of Time

By default, all new nodes initially have two states, True and False. We can see this by opening the Node Editor and selecting the States tab:

By clicking on the Generate States button, we create the states we need for our purposes. Here, we define 11 states, starting at 0 and increasing by 1 step:

The Node Editor now shows the newly-generated states:

Once Time is defined, we draw an arc from Time to Your Bag on Carousel. By doing so, we introduce a causal relationship, stating that Time influences the status of your bag.

Now we have a fully specified Bayesian network, which we can evaluate immediately.

Evidential Reasoning for Problem #1

Inference Tasks

If we filled the Conditional Probability Table correctly, we should now be able to validate at least the trivial cases straight away, e.g. for Your Bag on Plane=False.

Inference from Cause to Effect: Your Bag on Plane=False

We perform inference by setting such evidence via the corresponding Monitor in the Monitor Panel. We double-click the bar that represents the State False:

The setting of the evidence turns the node and the corresponding bar in the Monitor green:

Omnidirectional Inference: Your Bag on Carousel=False, Time=1

Diagnostic Reasoning

The first path is from Your Bag on Carousel to Your Bag on Plane. This type of reasoning from effect to cause is more commonly known as diagnosis. More formally, we can write:

�(��=��|��=��)

Inter-Causal Reasoning

�(��=��|��=��,��=1)

Bayesian Networks as Inference Engine

Then, we set the new observations via the Monitors in the Monitor Panel. The inference computation then happens automatically.

Given that you do not see your bag in the first minute, the probability that your bag made it onto the plane is now no longer at the marginal level of 50% but is reduced to 47.37%.