1 of 7

Chapter 6: Supervised Learning

The objective of what we call Supervised Learning is no different from that of predictive modeling. We wish to find regularities (a model) between the target variable and potential predictors from observations (e.g., historical data). Such a model will allow us to infer a distribution of the target variable from new observations. If the target variable is Continuous, the predicted distribution produces an expected value. For a Discrete target variable, we perform classification. The latter will be the objective of the example in this chapter.

Example: Tumor Classification

As part of their studies in the late 1980s and 1990s, the research team generated what became known as the Wisconsin Breast Cancer Database, which contains measurements of hundreds of FNA samples and the associated diagnoses. Several versions of this database have been extensively studied, even outside the medical field. Statisticians and computer scientists have proposed a wide range of techniques for this classification problem and have continuously raised the benchmark for predictive performance.

The objective of this chapter is to show how Bayesian networks, in conjunction with machine learning, can be used for classification. Furthermore, we wish to illustrate how Bayesian networks can help researchers generate a deeper understanding of the underlying problem domain. Beyond merely producing predictions, we can use Bayesian networks to precisely quantify the importance of individual variables and employ BayesiaLab to help identify the most efficient path towards diagnosis.

“Most breast cancers are detected by the patient as a lump in the breast. The majority of breast lumps are benign, so it is the physician’s responsibility to diagnose breast cancer, that is, to distinguish benign lumps from malignant ones. There are three available methods for diagnosing breast cancer: mammography, FNA with visual interpretation, and surgical biopsy. The reported sensitivity (i.e., ability to correctly diagnose cancer when the disease is present) of mammography varies from 68% to 79%, of FNA with visual interpretation from 65% to 98%, and of surgical biopsy close to 100%.

Therefore mammography lacks sensitivity, FNA sensitivity varies widely, and surgical biopsy, although accurate, is invasive, time-consuming, and costly. The goal of the diagnostic aspect of our research is to develop a relatively objective system that diagnoses FNAs with an accuracy that approaches the best achieved visually.”

Data

The Wisconsin Breast Cancer Database was created through the clinical work of Dr. William H. Wolberg at the University of Wisconsin Hospitals in Madison.

The dataset we are using for this tutorial contains 569 patient records, which contain a diagnosis plus features that were computed from digital images of fine-needle aspirates (FNA) of breast masses. More specifically, these features characterize the cell nuclei contained in the tissue samples.

ID number
Diagnosis (M=malignant, B=benign)
radius (mean of distances from center to points on the perimeter)
texture (standard deviation of gray-scale values)
perimeter
area
smoothness (local variation in radius lengths)
compactness (perimeter^2 / area - 1.0)
concavity (severity of concave portions of the contour)
concave points (number of concave portions of the contour)
symmetry
fractal dimension ("coastline approximation" - 1)

For each feature, the mean, standard error, and "worst" or largest (mean of the three largest values) were computed. For this tutorial, however, we only use the mean values as variables.

The diagnosis variable was established via subsequent biopsies or long-term monitoring of the tumor. It consists of two classes: 357 benign cases (62.7%) and 212 malignant cases (37.2%).

Note that this dataset from the Wisconsin Breast Cancer Database is different from the one we used in the original, printed edition of this book.

Tutorial

The following topics explain each step of the Supervised Learning workflow on the basis of this example.

Data Import and Discretization
Supervised Learning: Markov Blanket
Supervised Learning: Augmented Markov Blanket
Supervised Learning: Structural Coefficient Analysis
Inference: Automatic Evidence-Setting
Inference: Adaptive Questionnaire

Data Import and Discretization

Our modeling process begins with importing the dataset. You can download this dataset in CSV format via the link below or from data.world.

Note that this dataset from the Wisconsin Breast Cancer Database differs from the one we used in the original, printed edition of this book.

Data Import Wizard

We start the Data Import Wizard with Main Menu > Data > Open Data Source > Text File.

Next, we select the file WBCD2.CSV. Then, the Data Import Wizard guides us through the required steps.

In Step 1 of the Data Import Wizard, we click on Define Learning/Test Sets and specify to set aside a Test Set from the to-be-imported dataset.

We specify a random sample of 20% of the entire dataset to serve as a Test Set. The remaining 80% will serve as the Learning Set.

If you follow this tutorial and want to replicate the exact numerical values we present here, please check Fixed Seed under Options and set its value to 31. This ensures that the random number generator produces the same Learning Set and Test Set split that we use here.

In Step 2 of the Data Import Wizard BayesiaLab suggests a data type for each variable.

It identifies the diagnosis variable as Discrete, and all the feature variables are interpreted as Continuous. These default assignments are all correct.

We only need to correct the variable id, which BayesiaLab initially considers Continuous. However, id is a code to identify each patient, so we must specify it as a Row Identifier.

In Step 3 of the Data Import Wizard, no action is required. Our dataset has no missing values, so applying any Missing Values Processing is unnecessary.

However, given that many datasets do contain missing values, we devoted an entire chapter to dealing with that problem. Please see Chapter 9: Missing Values Processing.

In Step 4 of the Data Import Wizard, we need to discretize the Continuous variables in the dataset. Even though we could specify a discretization method for each Continuous variable separately, we want to apply the same algorithm to all.

So, we click Select All Continuous, and all Continuous variables are highlighted in the data table. The Discretization Type and all related options will now apply to all the selected nodes.

Given that we are building a model to predict the target variable diagnosis, it makes perfect sense to discretize all the continuous feature variables with that objective in mind.

Thus, we choose the Tree algorithm from the drop-down menu in the Multiple Discretization panel. The Tree algorithm attempts to find discretization thresholds so that each feature variable's information gain with regard to the target variable is maximized.

Note that the Tree algorithm requires a Target, which has to be a Discrete variable or a Continuous variable that has already been discretized. In our context, diagnosis is Discrete and, therefore, available from the Target dropdown menu.

Note that the discretization algorithm only uses the records from the Learning Set for creating the discretization threshold. If a Learning/Test Set split is specified in Step 1 of the Data Import Wizard, BayesiaLab automatically restricts the discretization algorithms to the Learning Set.

While you can also create a Learning/Test Set split after completing the Data Import Wizard, it would compromise the Test Set. Such a Test Set would no longer be properly out-of-sample as it would have contributed to the discretization.

Discretization Intervals

Bayesian networks are non-parametric probabilistic models. Therefore, there is no hypothesis with regard to the form of the relationships between variables (e.g., linear, quadratic, exponential, etc.). However, this flexibility has a cost. The number of observations necessary to quantify probabilistic relationships is higher than those required in parametric models. We use the heuristic of five observations per probability cell, which implies that the bigger the size of the probability tables, the larger must be the number of observations.

Two parameters affect the size of a probability table: the number of parents and the number of states of the parent and child nodes. A machine-learning algorithm usually determines the number of parents based on the strength of the relationships and the number of available observations. The number of states, however, is our choice, which we can set by means of Discretization (for Continuous variables) and Aggregation (for Discrete variables).

We can use our heuristic of five observations per probability cell to help us with the selection of the number of discretization Intervals:

We usually look for an odd number of states to be able to capture non-linear relationships. Given that we have a relatively small learning set of only 456 observations, we should estimate how many parents would be allowed based on this heuristic and a discretization with 3 states:

No parent: 3×5=15
One parent: 3×3×5=45
Two parents: 3×3×3×5=135
Three parents: 3×3×3×3×5=405
Four parents: 3×3×3×3×3×5=1,215

Considering a discretization with 5 states, we would obtain the following:

No parent: 5×5=25
One parent: 5×5×5=125
Two parents: 5×5×5×5=625

By using this heuristic, we hypothesize about the size of the biggest CPT of the to-be-learned Bayesian network and multiply this value by 5. Experience tells us that this is a rather practical heuristic, which typically helps us find a structure. However, this is by no means a guarantee that we will find a precise quantification of the probabilistic relationships.

Indeed, our heuristic is based on the hypothesis that all the cells of the CPT are equally likely to be sampled. Of course, such an assumption cannot hold as the sampling probability of a cell depends on its probability, i.e., either a marginal probability if the node does not have parents, or if it does have parents, a joint probability defined by the parent states and the child state.

Given our 456 observations and the scenarios listed above, we select a discretization scheme with a maximum of 3 states. This is a maximum in the sense that the Tree discretization algorithm could return 2 states if 3 were not needed. This happens to be the case for fractal_dimension_se and texture_se.

Import Report

Upon clicking Finish, BayesiaLab imports and discretizes the entire dataset and concludes with Step 5 of the Data Import Wizard by offering an Import Report.

Clicking Yes brings up the Import Report.

It is interesting to see that all the variables have indeed been discretized with the Tree algorithm and that all Discretization Intervals are variable-specific.

This means that all the variables are marginally dependent on the Target Node (and vice versa). This is promising: the more dependent variables we have, the easier it should be to learn a good model for predicting the Target Node.

Graph Window

Upon closing the Import Report, we see a representation of the newly imported database as a fully unconnected Bayesian network in the Graph Window.

State Names

In the dataset, the variable diagnosis contained the codes B and M, representing Benign and Malignant, respectively. For reading the analysis reports, however, it will be easier to work with a proper State Name instead of an abbreviation.

By double-clicking the node diagnosis, we open the Node Editor and then go to the State Names tab. There, we associate States B and M with new State Names:

B → Benign
M → Malignant

With all variables represented as nodes in the Graph Window, we are ready to proceed to Supervised Learning in this tutorial.

Supervised Learning: Markov Blanket

Markov Blanket

Given our objective of predicting the state of the variable diagnosis, i.e., Benign versus Malignant, we define diagnosis as the Target Node.

We need to specify this explicitly so that the Supervised Learning algorithm can focus on the characterization of the Target Node rather than on a representation of the entire Joint Probability Distribution (JPD) of the learning set.

Upon defining the Target Node, all Supervised Learning algorithms become available under Main Menu > Learning > Supervised Learning.

Markov Blanket Definition

Upon learning the Markov Blanket for diagnosis and after having applied the Automatic Layout (shortcut P), the resulting Bayesian network appears as follows:

We can see that the obtained network is a Naive structure on a subset of nodes.

This means that diagnosis has a direct probabilistic relationship with concave_points, fractal_dimension, texture, and perimeter.

All other nodes remain unconnected. The lack of their connections with the Target Node implies that these nodes are independent of the Target Node, given the nodes in the Markov Blanket.

Beyond distinguishing between predictors (connected nodes) and non-predictors (disconnected nodes), we can further examine the relationship versus the Target Node diagnosis by highlighting the Mutual Information of the arcs connecting the nodes.

This allows us to examine the Mutual Information between all nodes and the Target Node diagnosis, which enables us to gauge the relative importance of the nodes.

The top value shown in the box attached to each arc is the absolute value of the Mutual Information.

Below, the percentage refers to the Symmetric Normalized Mutual Information.

Performance Analysis

As we are not equipped with specific domain knowledge about the nodes, we will not further interpret these relationships but rather run an initial test regarding the Network Performance. We want to know how well this Markov Blanket model can predict the states of the diagnosis variable, i.e., Benign versus Malignant. This test is available via Main Menu > Analysis > Network Performance > Target.

As the analysis starts, BayesiaLab prompts us to specify the Target Evaluation Setting. In the given context, we select Evaluate All States and proceed.

Using the previously defined Test Set for evaluating our model, we obtain the initial performance results, including metrics such as Total Precision, R, R2, etc.

In the context of this example, the table in the center of the report, the so-called Confusion Matrix, is of special interest.

The Confusion Matrix features three tabs, Occurrences, Reliability, and Precision, which are illustrated below:

Of the 40 Malignant cases in the Test Set, 34 were identified correctly (True Positive Rate: 85%), and 6 were incorrectly predicted (False Negative Rate: 15%).

Of the 73 Benign cases in the Test Set, 68 were correctly identified as Benign (True Negative Rate: 93.15%), and 5 were incorrectly identified as Malignant (False Positive Rate: 6.85%).

The Overall Precision, which is reported at the top of the report window, is computed as the total number of correct predictions (True Positives + True Negatives) divided by the total number of cases in the Test Set, i.e., (68+34)÷113=90.265%.

K-Folds Cross-Validation

An Overall Precision of around 90% is encouraging, but we must remember that we randomly selected the Test Set.

To mitigate any sampling artifacts that may occur in such a one-off Test Set, we can systematically learn networks on a series of different subsets and then aggregate the test results.

For this purpose, we perform a K-Folds Cross-Validation, which will iteratively select K different Learning Sets and Test Sets and then learn the corresponding networks and test their respective performance.

With this approach, we need to remove the original Learning Set and Test Set split. Right-clicking on the database icon in the lower right corner of the Graph Window brings up a menu. Here, we select Remove Learning/Test Split.

Then, K-Folds Cross-Validation can be started via Main Menu > Tools > Resampling > Target Evaluation > K-Fold:

We use the same learning algorithm as before, i.e., the Markov Blanket, and choose K=10 as the number of sub-samples to be analyzed.

Of the total dataset of 569 cases, each of the ten iterations (folds) will create a Test Set of 56 randomly drawn samples and use the remaining 630 as the Learning Set. This means that BayesiaLab learns one network per Learning Set and then tests the performance on the respective Test Set. It is important to ensure that the Shuffle Samples option is checked.

The summary, including the synthesized results, is shown below. These results confirm the good performance of this model.

The Total Precision is 92.97%, with a False Negative Rate of 11.32%. This means that 24 of the 212 Malignant cases were incorrectly predicted as Benign.

Clicking Comprehensive Report produces a summary with additional analysis options.

Interpreting the Cross-Validation Results

It is helpful to click the Network Comparison button to understand what exactly is happening during the K-Folds Cross-Validation.

It brings up a Synthesis Structure of all the networks learned during the K-Folds Cross-Validation.

Black arcs in the Synthesis Structure above indicate that these arcs were present in the Reference Structure (below), i.e., the network that was learned on the basis of the original Learning Set.

The thickness of the arcs in the Synthesis Structure reflects how often these links were found in the course of the K-Folds Cross-Validation. The blue-colored arc indicates that the link was only found in some folds but that it was not part of the Reference Structure. The thickness of the blue arc is also proportional to the number of folds in which that arc was added.

The first structure after the Synthesis Structure is the Reference Structure, which was the current network when we started the K-Folds Cross-Validation.

After the Reference Network, we arrive at Comparison Structure 0. This network structure was learned in 1 out of 10 folds.

Comparison Network 1 was found 1 out of 10 times.

Comparison Network 2 was found 3 out of 10 times.

Comparison Network 3 was found 4 out of 10 times.

Comparison Network 4 was found 1 out of 10 times.

So, the first network we learned from the original Learning Set, the Reference Structure, was only found in 1 of the 10 networks learned during the 10-Fold Cross-Validation.

Given the relatively small sample size of the original Learning Set (431), it is unsurprising that larger sample sizes, i.e., 512 records in each fold of the 10-Fold Cross-Validation, would lead to alternative structures.

Markov Blanket Performance Summary

Performing the K-Fold Cross-Validation shows that the Overall Precision of a Markov Blanket model can approach 93%.

However, a False Negative Rate of over 10% may prevent such a model from being useful for clinical purposes. In the context of diagnosing cancer, a False Negative means missing a malignant case.

As a result, we proceed to another algorithm to evaluate its potential for improved diagnostic performance: Supervised Learning: Augmented Markov Blanket.

Supervised Learning: Augmented Markov Blanket

Augmented Markov Blanket

Given the performance of the Markov Blanket algorithm in the previous section (Supervised Learning: Markov Blanket), we are now looking for improvements by considering alternatives within the group of Supervised Learning algorithms.

BayesiaLab offers an extension to the Markov Blanket algorithm, namely the Augmented Markov Blanket, which performs an Unsupervised Learning algorithm on the nodes in the Markov Blanket. This relaxes the constraint of requiring orthogonal child nodes. Thus, it helps identify any influence paths between the predictor variables and potentially improves the predictive performance. Adding such arcs would be similar to automatically creating interaction terms in a regression analysis.

As expected, the resulting network is slightly more complex than the Markov Blanket.

Structure Comparison

BayesiaLab offers a tool for formally comparing network structures, which we can apply to the Augmented Markov Blanket we just learned and the previously learned Markov Blanket.

We can use Main Menu > Tools > Compare > Structure to highlight the differences between both networks.

Given that the addition of two arcs is immediately visible, this function may appear overkill for our example. However, in more complex situations, Structure Comparison can be rather helpful.

By default, the current network appears as the Reference Network, and for the Comparison Network, we select the previously learned Markov Blanket.

Clicking Compare brings up the Structure Comparison Report.

This report provides a list of arcs common to both networks and another list of those removed in the Comparison Network.

Clicking Structure Comparison shows a Synthesis Structure that visualizes these differences.

The arcs that exist in the Reference Structure, i.e., Augmented Markov Blanket, but do not exist in the Comparison Structure, i.e., the Markov Blanket, are highlighted in red.

Please see Direct Structural Network Comparison for a much more detailed explanation of this tool.

Performance Analysis

Given that the Augmented Markov Blanket algorithm has only added a single additional arc to the network, compared to what the Markov Blanket algorithm produced, we may not expect a dramatic difference in predictive performance.

However, any improvements in terms of reducing the False Negative would be welcome. So, we run the Network Performance Analysis again: Main Menu > Analysis > Network Performance > Target.

As the analysis starts, BayesiaLab prompts us to specify the Target Evaluation Setting. Again, we select Evaluate All States and proceed.

Using the previously defined Test Set for evaluating our model, we obtain the performance report:

Now, we can compare the new performance metrics of the Augmented Markov Blanket with the ones previously obtained with the Markov Blanket.

Evaluation Against Test Set

Most notable is the change in False Negatives, which drops from 5 to 1, i.e., a reduction from 15% to 2.5% in the False Negative Rate.

If this performance were to hold, it could turn this model from moderately useful to a valuable diagnostic tool.

Cross-Validation

Recognizing the potential of the Augmented Markov Blanket algorithm, we proceed to the K-Folds Cross-Validation: Main Menu > Tools > Cross-Validation > Targeted Evaluation > K-Fold.

The steps are identical to what we did for the Markov Blanket, so we move straight to the report.

As it turns out, the Cross-Validation Report does not confirm the excellent False Negative Rate that the evaluation with regard to the Test Set suggested.

Nevertheless, comparing the Cross-Validation results, the Augmented Markov Blanket algorithm delivers an improvement over the Markov Blanket.

Cross-Validation Result Synthesis

With the apparent advantage of the Augmented Markov Blanket model, we will now try to fine-tune this model further in pursuit of a performance gain.

In the next section, Structural Coefficient Analysis, we explore how adjusting the Structural Coefficient can bring us closer to the performance limits of the model.

Supervised Learning: Structural Coefficient Analysis

Up to this point, the differences in model structure and the corresponding performance were a result of the learning algorithm, i.e., Markov Blanket vs. Augmented Markov Blanket.

Now we explore how different levels of network complexity could potentially improve the Augmented Markov Blanket model. In other words, could a more complex network provide better performance without risking over-fitting?

Structural Coefficient

To modify a network's complexity, we now introduce the Structural Coefficient $\alpha$ .

Throughout this chapter, we abbreviate "Structural Coefficient $\alpha$ " with "SC."

This parameter allows changing the internal number of observations and, thus, determines a kind of “significance threshold” for network learning. Consequently, it influences the degree of complexity of the induced networks. The internal number of observations is defined as:

N' = {N \over {SC}}

where $N$ is the number of samples in the dataset.

By default, SC is set to 1, which reliably prevents the learning algorithms from overfitting the model to the data. However, in studies with relatively few observations, the analyst’s judgment is needed regarding whether a downward adjustment of this parameter can be justified. Reducing SC means increasing $N^′$ , which is like increasing the number of observations in the dataset via resampling.

On the other hand, increasing SC beyond 1 means reducing $N^′$ , which can help manage the complexity of networks learned from large datasets. Conceptually, reducing $N^′$ is equivalent to sampling the dataset.

Structural Coefficient Analysis

We now perform a Structural Coefficient Analysis on the basis of the Augmented Markov Blanket, which generates several metrics that help to trade off between complexity and fit: Main Menu > Tools > Multi-Run > Structural Coefficient Analysis.

Note that we use the original Learning/Test Set split again, which allows us to directly compare the in-sample and out-of-sample predictive performance as a function of varying SC levels.

BayesiaLab prompts us to specify the range of SC values to be examined and the number of iterations to be performed. It is worth noting that the minimum SC value should not be set to 0, or even close to 0, without careful consideration.

An SC value of 0 would create a fully connected network, which can take a very long time to learn, depending on the number of variables, or even exceed the memory capacity of the computer running BayesiaLab. Technically, SC=0 implies an infinite dataset, which results in all relationships between nodes becoming significant.

Setting the Number of Iterations determines the interval steps to be taken within the specified range of the Structural Coefficient. We choose 10 iterations over an SC range between 0.1 and 1, which gives us increments of 0.1. With more complex models and more data, we might be more conservative and start with a narrower range, e.g., 0.5 to 1.

Clicking OK opens up a report that shows the range of changes due to modifying the Structural Coefficient.

We only show a portion of the report here and omit a discussion of its elements. For a thorough explanation of this report, please see Structural Coefficient Analysis.

Instead, we focus on the Curve function, which can be activated by clicking on the corresponding button at the bottom of the report. This tool can plot the metrics that we specified earlier in the settings as we started the Structural Coefficient Analysis.

Our objective is to determine the correct level of network complexity for reliably high predictive performance while avoiding the over-fitting of the data. By clicking Curve, we can plot several different metrics for this purpose.

Structure/Target Precision Ratio

Selecting Structure/Target Precision Ratio provides a helpful measure for making trade-offs between predictive performance versus network complexity.

This plot can be best interpreted when following the curve from right to left. Moving to the left along the x-axis lowers the Structural Coefficient, which, in turn, results in a more complex Structure.

It becomes problematic when the Structure value increases faster than the Precision value, i.e. we increase complexity without improving Precision.

Typically, the “elbow” of the L-shaped curve identifies this critical point. Here, a visual inspection suggests that the “elbow” is around SC=0.4. The portion of the curve further to the left on the x-axis, i.e., SC<0.4, shows that the structure is increasing without improving precision, which suggests overfitting. Hence, SC=0.4 could be a good value to examine further.

Another sign of overfitting is when the predictive performance of a model starts to diverge between the Learning Set and the Test Set. This means that the out-of-sample performance is no longer comparable to the in-sample performance.

This is precisely what we can observe with the Target Precision Curves for both the Learning Set and the Test Set. For SC>0.5, the curves are parallel, which means that in-sample and out-of-sample performance move in sync.

However, as the SC value drops below 0.5, the Learning Set performance increases while the Test Set performance drops, i.e., the curves diverge. The Target Precision for the Learning Set keeps increasing, while the Target Precision for the Test Set drops.

Augmented Markov Blanket (SC=0.5)

Having considered the curves in both of the above plots, we choose SC=0.5 for further evaluation.

The SC value can be set by right-clicking on the background of the Graph Panel and then selecting Edit Structural Coefficient from the Node Contextual Menu or via the menu: Main Menu > Edit > Edit Structural Coefficient.

The SC value can then be set with a slider or by typing in a numerical value.

As expected, this produces a more complex network.

The key question is, will this increase in complexity deliver a performance advantage over the previously learned models?

Performance Analysis

So, we perform K-Folds Cross-Validation again, this time using the Augmented Markov Blanket at SC=0.5. The right panel in the overview below shows the results.

For comparison, we also show the performance of the earlier models we learned, i.e., the Markov Blanket (SC=1) in the left panel and the Augmented Markov Blanket (SC=1) in the center panel.

Among a myriad of other available measures, we have typically referenced the Overall Precision for evaluation purposes. In this regard, the latest Augmented Markov Blanket (SC=0.5) does not show an improvement, i.e., the Overall Precision remains at 94.2%

So, is there any benefit to the added complexity? It would depend on the context. Here, the objective is to distinguish between benign and malignant cell samples. Presumably, a false negative would be the worst forecast error. It would label a malignant sample as benign and perhaps cause a delay in a patient's treatment.

Focusing on the False Negative Rate of the three models, we see an improvement from 11.32% (left) to 9.43% (center) to 8.49% (right). This means that the best model in this regard reduces the number of False Negatives by one-third.

There are numerous other approaches available in BayesiaLab to help improve the model further, e.g., choosing a different learning algorithm, learning structural priors, reviewing the discretization, etc.

However, for the purposes of this tutorial, we conclude our model optimization efforts here and continue on the basis of the Augmented Markov Blanket (SC=0.5) in the next section: Model Inference.

Inference: Automatic Evidence-Setting

Context

Early in this chapter, we used the Augmented Markov Blanket algorithm to machine-learn a predictive model for classifying cell samples.
Subsequently, we optimized the model with the Structural Coefficient Analysis workflow.
We can now use the validated model for analysis and inference.
In this and the next two sections of this chapter, we look at different ways of performing inference:
- Automatic-Evidence Setting
- Adaptive Questionnaire
- Target Interpretation Tree

Validation Mode

Before proceeding to Automatic Evidence-Setting, we bring up all the Monitors connected to the Target Node in the Monitor Panel.

Since we have Target Node, we can right-click inside the Monitor Panel to select Sort > Target Correlation from the Monitor Panel Context Menu.

Alternatively, we can do the same via Main Menu > Monitor > Sort > Target Correlation.

The Monitor of the Target Node is placed first in the Monitor Panel, followed by the other Monitors according to their “correlation” with the Target Node, from highest to lowest.

Note that we use “correlation” not literally in this context. Rather, the sort order of the Monitors is determined by Mutual Information.

Automatic Evidence Setting (Interactive Inference)

Given that we have a predictive model in place, we can use BayesiaLab to review its individual predictions record by record.

This feature is called Automatic Evidence-Setting, which can be accessed here: Main Menu > Inference > Automatic Evidence-Setting.

In earlier releases of BayesiaLab, this function was called Interactive Inference.

Record #0

The first record in the dataset is displayed in the screenshot below as record #0.

Additionally, the Row Identifier, 842302, is displayed in the Status Bar to the right of the Progress Bar at the bottom of the Graph Window.

The Monitors display the values of the variables in that record, i.e., the set of evidence or observations.

Given these observations, the model predicts a 92.36% probability that the diagnosis is malignant (the Monitor of the Target Node features a green background).

With such a high probability, the prediction diagnosis=malignant is the rational prediction.

As it turns out, it is indeed the correct prediction for record #0 (Row Identifier 842302). The actual value recorded in the dataset is represented by a light blue bar, meaning diagnosis=malignant was the ground truth in this case.

As we know from all the validation steps, the model performs well with an Overall Precision above 90%. Hence, most predictions are clear, just like this one.

Record #138

However, exceptions exist, such as record #138 (Row Identifier 868826). Here, the probability of diagnosis=benign is approximately 51%. Given this probability, the model predicts diagnosis=benign. However, this turns out to be incorrect. Here, the actual observation is diagnosis=malignant, which is again highlighted by a light blue bar.

Workflow Animation

Inference: Adaptive Questionnaire

Context

In situations in which individual cases are under review, e.g., when diagnosing a patient, BayesiaLab can provide diagnostic support by means of the Adaptive Questionnaire.
This approach helps prioritize what variable to investigate or what pieces of evidence to collect in order to reduce the uncertainty regarding a target variable of interest.
Whenever you have a Bayesian network with a Target Node, regardless of whether the network was machine-learned or created from expert knowledge, you can launch the Adaptive Questionnaire.
Importantly, the Adaptive Questionnaire seeks the optimal sequencing of evidence for a specific case or instance rather than creating a set of rules that apply in general.
For creating a generalized set of priorities, please see Target Interpretation Tree in this chapter.

Usage

The Adaptive Questionnaire can be started via Main Menu > Inference > Adaptive Questionnaire.

For a Target Node with more than two states, it can be helpful to specify a Target State for the Adaptive Questionnaire.

Setting a Target Node allows BayesiaLab to compute the Binary Mutual Information and then focus on the designated Target State.

However, as the Target Node in our example is binary by default, setting a Target State is superfluous.

Costs

Furthermore, we can set the cost of collecting observations via the Cost Editor, which can be started by clicking the Edit Observations Costs button.

This is helpful if certain variables are more costly to observe or require more effort to obtain than others. So, Costs do not necessarily have to represent a financial cost. For instance, we could make Costs proportional to the difficulty of collecting observations.

Adaptive Evidence-Seeking

In analyzing Fine Needle Aspirates, all image attributes are obtained simultaneously. As a result, this particular domain is not ideal for demonstrating the Adaptive Questionnaire.

A better example would be a diagnostic process, in which a clinician collects observations from a patient in a targeted way. We can imagine that a physician starts the diagnosis process by collecting easy-to-obtain data, such as blood pressure, before proceeding to more elaborate (and more expensive) diagnostic techniques, such as performing an MRI.

Here, we simulate using the Adaptive Questionnaire as if we could choose the order of collecting evidence.

After starting the Adaptive Questionnaire, BayesiaLab presents the Monitor of the Target Node and displays its marginal probability. That Monitor is highlighted in green.

Observation #1

The Monitor of the node we just observed drops to the bottom of the list. Given that we already know its value, no further information can be gained from it.
The small gray arrows inside the Monitors indicate how much the probabilities have changed.

Note that we are not merely seeing the next-in-line Monitor "moving up." Rather, the entire list is recomputed, given the most recent piece of evidence.

Observation #2

The order of the remaining unobserved nodes is now:

Observation #3

The order of the remaining unobserved nodes is now:

Observation #4

Observation #5

In this hypothetical example, the last observation appears to have a rather substantial impact on the diagnosis.

Workflow Animation

Summary & Next Steps

The Adaptive Questionnaire is a highly practical tool for seeking the optimal next piece of evidence when trying to determine the state of a Target Node.

We used the Adaptive Questionnaire via the Graphical User Interface in BayesiaLab in this example. For situations when end-users do not have access to the BayesiaLab software, you can publish an Adaptive Questionnaire via the WebSimulator. This allows anyone to interact with an Adaptive Questionnaire through a web browser.

Finally, BayesiaLab can produce a static version of the Adaptive Questionnaire, which can be used entirely offline. This tool is the Target Interpretation Tree, which we discuss in the next section.

Data Import and Discretization

Our modeling process begins with importing the dataset. You can download this dataset in CSV format via the link below or from data.world.

WBCD2.csv

Note that this dataset from the Wisconsin Breast Cancer Database differs from the one we used in the original, printed edition of this book.

Data Import Wizard

We start the Data Import Wizard with Main Menu > Data > Open Data Source > Text File.

Next, we select the file WBCD2.CSV. Then, the Data Import Wizard guides us through the required steps.

In Step 1 of the Data Import Wizard, we click on Define Learning/Test Sets and specify to set aside a Test Set from the to-be-imported dataset.

We specify a random sample of 20% of the entire dataset to serve as a Test Set. The remaining 80% will serve as the Learning Set.

In Step 2 of the Data Import Wizard BayesiaLab suggests a data type for each variable.

It identifies the diagnosis variable as Discrete, and all the feature variables are interpreted as Continuous. These default assignments are all correct.

We only need to correct the variable id, which BayesiaLab initially considers Continuous. However, id is a code to identify each patient, so we must specify it as a Row Identifier.

In Step 3 of the Data Import Wizard, no action is required. Our dataset has no missing values, so applying any Missing Values Processing is unnecessary.

However, given that many datasets do contain missing values, we devoted an entire chapter to dealing with that problem. Please see Chapter 9: Missing Values Processing.

So, we click Select All Continuous, and all Continuous variables are highlighted in the data table. The Discretization Type and all related options will now apply to all the selected nodes.

Given that we are building a model to predict the target variable diagnosis, it makes perfect sense to discretize all the continuous feature variables with that objective in mind.

Discretization Intervals

We can use our heuristic of five observations per probability cell to help us with the selection of the number of discretization Intervals:

$StateCoun{t^{ParentCount + 1}} \times 5 \le ObservationCount$

No parent: 3×5=15
One parent: 3×3×5=45
Two parents: 3×3×3×5=135
Three parents: 3×3×3×3×5=405
Four parents: 3×3×3×3×3×5=1,215

Considering a discretization with 5 states, we would obtain the following:

No parent: 5×5=25
One parent: 5×5×5=125
Two parents: 5×5×5×5=625

Import Report

Upon clicking Finish, BayesiaLab imports and discretizes the entire dataset and concludes with Step 5 of the Data Import Wizard by offering an Import Report.

Clicking Yes brings up the Import Report.

It is interesting to see that all the variables have indeed been discretized with the Tree algorithm and that all Discretization Intervals are variable-specific.

Graph Window

Upon closing the Import Report, we see a representation of the newly imported database as a fully unconnected Bayesian network in the Graph Window.

Additionally, there is a tag on the database icon in the lower right corner of the Graph Window: The icon confirms that we have Learning/Test Set split in place.

State Names

By double-clicking the node diagnosis, we open the Node Editor and then go to the State Names tab. There, we associate States B and M with new State Names:

B → Benign
M → Malignant

With all variables represented as nodes in the Graph Window, we are ready to proceed to Supervised Learning in this tutorial.

Supervised Learning: Structural Coefficient Analysis

Up to this point, the differences in model structure and the corresponding performance were a result of the learning algorithm, i.e., Markov Blanket vs. Augmented Markov Blanket.

Structural Coefficient

To modify a network's complexity, we now introduce the Structural Coefficient $\alpha$ .

Throughout this chapter, we abbreviate "Structural Coefficient $\alpha$ " with "SC."

N' = {N \over {SC}}

where $N$ is the number of samples in the dataset.

Structural Coefficient Analysis

Note that we use the original Learning/Test Set split again, which allows us to directly compare the in-sample and out-of-sample predictive performance as a function of varying SC levels.

Clicking OK opens up a report that shows the range of changes due to modifying the Structural Coefficient.

We only show a portion of the report here and omit a discussion of its elements. For a thorough explanation of this report, please see Structural Coefficient Analysis.

Structure/Target Precision Ratio

Selecting Structure/Target Precision Ratio provides a helpful measure for making trade-offs between predictive performance versus network complexity.

It becomes problematic when the Structure value increases faster than the Precision value, i.e. we increase complexity without improving Precision.

Augmented Markov Blanket (SC=0.5)

Having considered the curves in both of the above plots, we choose SC=0.5 for further evaluation.

The SC value can then be set with a slider or by typing in a numerical value.

The Structural Coefficient icon now indicates that we are employing an SC value other than the default of 1.

The Structural Coefficient icon features an unbalanced scale. This symbolizes that we departed from the balanced weighting of fit and complexity. Instead, we have "put our thumb on the scale" to pursue a better fit of our model while accepting a higher complexity.

After returning to the Modeling Mode ( or ) we relearn the network using the same Augmented Markov Blanket algorithm as before.

As expected, this produces a more complex network.

The key question is, will this increase in complexity deliver a performance advantage over the previously learned models?

Performance Analysis

So, we perform K-Folds Cross-Validation again, this time using the Augmented Markov Blanket at SC=0.5. The right panel in the overview below shows the results.

For comparison, we also show the performance of the earlier models we learned, i.e., the Markov Blanket (SC=1) in the left panel and the Augmented Markov Blanket (SC=1) in the center panel.

However, for the purposes of this tutorial, we conclude our model optimization efforts here and continue on the basis of the Augmented Markov Blanket (SC=0.5) in the next section: Model Inference.

Supervised Learning: Markov Blanket

Markov Blanket

Given our objective of predicting the state of the variable diagnosis, i.e., Benign versus Malignant, we define diagnosis as the Target Node.

Upon defining the Target Node, all Supervised Learning algorithms become available under Main Menu > Learning > Supervised Learning.

Markov Blanket Definition

Upon learning the Markov Blanket for diagnosis and after having applied the Automatic Layout (shortcut P), the resulting Bayesian network appears as follows:

We can see that the obtained network is a Naive structure on a subset of nodes.

This means that diagnosis has a direct probabilistic relationship with concave_points, fractal_dimension, texture, and perimeter.

All other nodes remain unconnected. The lack of their connections with the Target Node implies that these nodes are independent of the Target Node, given the nodes in the Markov Blanket.

This function is accessible in Validation Mode (F5 or ) by selecting Main Menu > Analysis > Visual > Overall > Arc > Mutual Information.

Each arc's thickness is now proportional to the Mutual Information of the nodes it connects. Furthermore, the icon indicates that additional information, i.e., the Arc Comments, is available to be displayed.

So, we select Main Menu > View > Show Arc Comments. Alternatively, clicking the Show Arc Comment button in the Toolbar achieves the same .

This allows us to examine the Mutual Information between all nodes and the Target Node diagnosis, which enables us to gauge the relative importance of the nodes.

The top value shown in the box attached to each arc is the absolute value of the Mutual Information.

Below, the percentage refers to the Symmetric Normalized Mutual Information.

Performance Analysis

As the analysis starts, BayesiaLab prompts us to specify the Target Evaluation Setting. In the given context, we select Evaluate All States and proceed.

Using the previously defined Test Set for evaluating our model, we obtain the initial performance results, including metrics such as Total Precision, R, R2, etc.

In the context of this example, the table in the center of the report, the so-called Confusion Matrix, is of special interest.

The Confusion Matrix features three tabs, Occurrences, Reliability, and Precision, which are illustrated below:

Of the 40 Malignant cases in the Test Set, 34 were identified correctly (True Positive Rate: 85%), and 6 were incorrectly predicted (False Negative Rate: 15%).

Of the 73 Benign cases in the Test Set, 68 were correctly identified as Benign (True Negative Rate: 93.15%), and 5 were incorrectly identified as Malignant (False Positive Rate: 6.85%).

K-Folds Cross-Validation

An Overall Precision of around 90% is encouraging, but we must remember that we randomly selected the Test Set.

To mitigate any sampling artifacts that may occur in such a one-off Test Set, we can systematically learn networks on a series of different subsets and then aggregate the test results.

Then, K-Folds Cross-Validation can be started via Main Menu > Tools > Resampling > Target Evaluation > K-Fold:

We use the same learning algorithm as before, i.e., the Markov Blanket, and choose K=10 as the number of sub-samples to be analyzed.

The summary, including the synthesized results, is shown below. These results confirm the good performance of this model.

The Total Precision is 92.97%, with a False Negative Rate of 11.32%. This means that 24 of the 212 Malignant cases were incorrectly predicted as Benign.

Clicking Comprehensive Report produces a summary with additional analysis options.

Interpreting the Cross-Validation Results

It is helpful to click the Network Comparison button to understand what exactly is happening during the K-Folds Cross-Validation.

It brings up a Synthesis Structure of all the networks learned during the K-Folds Cross-Validation.

Black arcs in the Synthesis Structure above indicate that these arcs were present in the Reference Structure (below), i.e., the network that was learned on the basis of the original Learning Set.

We can scroll through all the networks discovered during the K-Folds Cross-Validation using the record selector icons .

The first structure after the Synthesis Structure is the Reference Structure, which was the current network when we started the K-Folds Cross-Validation.

After the Reference Network, we arrive at Comparison Structure 0. This network structure was learned in 1 out of 10 folds.

Comparison Network 1 was found 1 out of 10 times.

Comparison Network 2 was found 3 out of 10 times.

Comparison Network 3 was found 4 out of 10 times.

Comparison Network 4 was found 1 out of 10 times.

So, the first network we learned from the original Learning Set, the Reference Structure, was only found in 1 of the 10 networks learned during the 10-Fold Cross-Validation.

Markov Blanket Performance Summary

Performing the K-Fold Cross-Validation shows that the Overall Precision of a Markov Blanket model can approach 93%.

However, a False Negative Rate of over 10% may prevent such a model from being useful for clinical purposes. In the context of diagnosing cancer, a False Negative means missing a malignant case.

As a result, we proceed to another algorithm to evaluate its potential for improved diagnostic performance: Supervised Learning: Augmented Markov Blanket.