1 of 20

Step 4 — Discretization and Aggregation

Context

Step 4 — Discretization and Aggregation requires you to make several more important choices before concluding the import process.
As opposed to the previous steps, which all consisted of a single screen, Step 4 provides one screen per variable type for six screens.

Overview of Screens

As you go from Step 3 to Step 4, the variable that you last selected in Step 3 remains highlighted.
And depending on the variable type, Step 4 starts with one of six possible screens, one for each variable type. Click on the thumbnails in the following table for a preview.
Note that for Row Identifier and Unused variables, no actions are available. Except for the Data panel, the corresponding screens are blank.

For all other variable types, we discuss all available options in detail in separate sections:

Variable Type-Specific Screens

Weights
Learning/Test
Discretization
Aggregation

Weights

Context

This screen is only available if you designated a Weight variable in Step 2 — Definition of Variable Types.

Usage

Click on that Weight variable in the Data panel, and the Normalize Weights checkbox appears as the only option on the screen.

You need to determine whether to apply Normalize Weights or not:
- If yes, the Weights will be normalized so that the total number of cases considered by BayesiaLab for machine learning is equal to the actual number of samples in the dataset.
- If no, the Weight variable will be treated as representing the actual number of observed cases. So, a weight of 10 for one observation would be treated and counted like ten instances of that same observation. As a result, the total number of cases considered by BayesiaLab would correspond to the population from which the weight was calculated.
- This example illustrates the situation for a survey consisting of 10 observations:
- If you do not normalize, BayesiaLab would consider a sample of 100 for learning purposes and presumably find spurious relationships. This "over-counting" by a factor of 10 has the same effect as reducing the Structural Coefficient to 0.1.
- If you normalize, BayesiaLab considers the correct proportions of the weighted samples but still only considers ten observations in total for learning purposes.

If you have specified a Weight variable, it will be taken into account in the Discretization and Aggregation algorithms.

Learning/Test

Context

This screen is only available if you designated a Learning/Test variable in Step 2 — Definition of Variable Types.

Usage

Select the Learning/Test variable by clicking on its header or into the corresponding column.
Select BayesiaLab's learning and test labels from the drop-down lists to match the codes in your dataset.
Additionally, you can see the proportion of cases for each code in your dataset.

Given that you have a variable of the type Learn/Test, only the "learning" rows will be taken into account for Discretization and Aggregation. Otherwise, you would partially defeat the purpose of having a hold-out set.

Discretization

Context

BayesiaLab requires the discretization of all Continuous variables, and in this screen, you need to specify how to discretize those variables.
The Discretization process determines how a Continuous variable will be imported into BayesiaLab, i.e.,
- the number of intervals (or bins);
- the values of the thresholds which define the ranges of the intervals.
These attributes define the transformation of the underlying Continuous variable in the dataset into a discretized Continuous node in BayesiaLab.

To learn more about the important distinction between Continuous and Discrete nodes, please see these topics:

Continuous Nodes
Discrete Nodes

Usage

Select one or more Continuous variables and click into one of the headers or one of the corresponding columns.
The Discretization panel appears.

Discretization Types Overview

The first item in the Discretization panel is the Discretization Type drop-down menu.
The items on this list can be grouped into Automatic Discretization versus Manual Discretization.
- The bottom item on the drop-down menu, Manual, refers to a Manual Discretization approach in which you have full control over thresholds, etc.
- The remaining eleven items all refer to different kinds of Automatic Discretization.

However, even in Manual Discretization, you take advantage of the algorithms available with Automatic Discretization.

Discretization Types in Detail

Manual Discretization
Automatic Discretization

Manual Discretization

Context

Manual Discretization

Select Manual from the drop-down menu.
Several additional items and buttons appear on the left side, plus a Cumulative Distribution Function (CDF) is shown on the right. This CDF plot can help in selecting appropriate discretization intervals.
In the screenshot below, the variable Standing Height (cm) is selected, meaning that the CDF plot corresponds to that variable.

Click on the Density Function button, and the Probability Density Function (PDF) of the same variable appears.
Now the button reads Distribution Function, and by clicking it, you can toggle back to the CDF view.

By default, only one threshold is placed at the mean value of the corresponding variable.
This threshold appears as a horizontal line on the CDF and a vertical line on the PDF.
The CDF and PDF plots are interactive; you can add, delete, and modify thresholds.

Editing Thresholds

The following instructions apply to both plots:

To select a threshold, left-click on that threshold.
The selected threshold is highlighted in red.
The remaining thresholds on the plot remain blue.
The precise numerical value of a selected threshold is shown in the Threshold Value field to the right of the plot.
To move a threshold, click on it and hold, then move it. Release to fix its position.
The percentages displayed at the end of a selected threshold refer to the share of observations that fall into the intervals above and below this threshold.
Instead of moving the selected threshold with your cursor, you can type a specific value into the Threshold Value field.
To add an additional threshold, right-click with your cursor on the desired position.
To remove an existing threshold, right-click on it to delete it.
A zoom function is available for examining the plot in detail:
- Hold the Ctrl key, click and hold the left mouse button, then move the cursor across the range you wish to focus.
  - To revert to the default zoom, hold Ctrl, then double-click anywhere in the plot area.
  - You can zoom in repeatedly until you have reached the desired magnification level.
As an alternative to selecting a threshold by left-clicking, you can scroll through all thresholds using the Previous and Next buttons.

Note that as soon as a threshold is defined on a Continuous variable, it is considered Discretized, and the variable's data column is colored in soft blue.

The interactive CDF and PDF plots are similar to the editing functions available under Curve View in the Node Editor.

Workflow Illustration

We re-use the dataset from the previous steps, so we can fast-forward to Step 4 and focus on that step.

Generate a Discretization

While remaining on the Manual Discretization screen, you can also utilize the Generate a Discretization function.

Click on the Generate a Discretization button.
Then, select the Type from the drop-down menu, e.g., the R2-GenOpt algorithm. You have nine algorithms available, i.e., the univariate methods only.

Choose the number of Intervals, e.g., 5.
Set a Minimum Interval Weight, which defines the minimum prior probability of an interval in percent. The default value is 1%.
Note that you can set defaults for the above settings under Main Menu > Window > Preferences > Discretization.

Additionally, there are options for Log Transformation and Isolate Zeros, which we discuss in the context of Automatic Discretization.
Click OK to perform the Discretization.

Workflow Illustration

Transfer the Discretization Thresholds

Select the source variable from which you wish to copy the thresholds.
Click the Transfer the Discretization Thresholds button.
A new window opens up that allows you to select one or more target variables.
Select the target variables.
Click OK.

Workflow Animation

Create a Class for Each Type of Discretization

This checkbox is synchronized across Manual and Automatic Discretization processes.
If checked, BayesiaLab automatically creates Classes for each type of Discretization, i.e., all variables that are discretized with the same algorithm will belong to the same Class.
Note that variables that were discretized manually, even if you used the Generate a Discretization button, will all become members of the Class MANUAL.
You can review the Class memberships in the Class Editor after the data import process is complete.

Load Discretizations

This function allows you to load a Discretization Dictionary with saved Discretization Intervals and Discretization Methods.
This approach is particularly helpful when you repeatedly import datasets with the same variables for which you have already found a suitable discretization.

The following text file illustrates the syntax of a Discretization Dictionary.

Automatic Discretization

Context

Except for Manual, all items in the Type menu represent Automatic Discretization algorithms.

Usage

Selecting a Discretization algorithm applies variable by variable, i.e., you can use a different algorithm for each Continuous variable.
To select a variable, click on the variable header or anywhere inside the column.
You can perform the selection and deselection of multiple variables with keystroke combinations commonly used in spreadsheet editing:
- Ctrl+Click: add a variable to the current selection.
- Shift+Click: add all variables between the currently selected and the clicked variable to the selection.
- Ctrl+A: select all variables in the Data panel. However, selecting all variables is not useful here in Step 4, as there are no actions that can apply to all variable types.
- Shift+End: select all variables from the currently selected variable to the rightmost variable in the table.
- Shift+Home: select all variables from the currently selected variable to the leftmost variable in the table.
Click the Select All Continuous button to select all Continuous variables.
- Note that this action will also select any variables which you have already discretized manually. As a result, you may override your previous choices.
- Note that Continuous variables already discretized manually are highlighted in soft blue.

If you do not specify an algorithm for a variable that was not manually discretized either, the default Discretization algorithm with its default settings will be used.
You can set the default Discretization algorithm under Main Menu > Window > Preferences > Discretization. [+] Show More
For the following algorithms, a Log Transformation is available as an option:
- Applying the Log Transformation is useful if you have a high density of values at the bottom end of the variable domain. This "stretches" the scale for small values approaching zero.
- Note that the Log Transformation is only used temporarily for discretization purposes. Thus, the values of the thresholds and values of the intervals can all be interpreted based on the original scale.
For the following algorithms, the option Isolate Zeros is available:
- Separating 0 into a separate interval can be useful for zero-inflated distributions so as to clearly separate small values from "absolutely nothing."
Click Finish to perform the Discretization.
A progress bar displays the status of the Discretization process:

If a Filtered Value is defined for a Continuous variable, a new artificial interval with an infinitesimally small width of 10-7 will be added after the intervals defined in this step. This newly-created state will serve as the Filtered State, and "*", i.e., the asterisk character, will be its State Name.
At its conclusion, BayesiaLab opens up a Graph Window with all imported variables now represented as nodes.

Automatic Discretization Algorithms in Detail

Tree

Context

Tree is one of the Automatic Discretization algorithms for Continuous variables in Step 4 — Discretization and Aggregation of the Data Import Wizard.

Algorithm Details & Recommendations

Tree is a bivariate discretization method. It machine-learns a decision tree that uses the to-be-discretized variable for representing the conditional probability distributions of the Target variable given the to-be-discretized variable. Once the Tree is learned, it is analyzed to extract the most useful thresholds.
It is the method of choice in the context of Supervised Learning, i.e., if you plan to machine-learn a model to predict the Target variable.
At the same time, we do not recommend using Tree in the context of Unsupervised Learning. The Tree algorithm creates bins that are biased toward the designated Target variable. Naturally, emphasizing one particular variable would run counter to the intent of Unsupervised Learning.
Note that if the to-be-discretized variable is independent of the selected Target variable, it will be impossible to build a tree, and BayesiaLab will prompt you to select a univariate discretization algorithm.
All manually discretized variables can be used as a Target variable for Tree discretization.

Using a Target variable for Discretization does not create a Target Node in the network.

Perturbed Tree

Context

Perturbed Tree is one of the Automatic Discretization algorithms for Continuous variables in Step 4 — Discretization and Aggregation of the Data Import Wizard.

Algorithm Details & Recommendations

The Perturbed Tree algorithm is designed to optimize the representation of the probabilistic dependency between a Target variable and the to-be-discretized variable. It is an extension of the Tree discretization algorithm, and it functions as follows:
- Data Perturbation generates a range of datasets.
- For each perturbed dataset, a univariate tree is learned to predict the Target variable with the to-be-discretized continuous variable.
- Extracting the most frequent thresholds produces the final discretization.
The Perturbed Tree algorithm takes into account the Minimum Interval Weight and can reduce the number of bins if necessary. It can also be more robust than the simple Tree discretization.

Supervised Multivariate

Context

Supervised Multivariate is one of the Automatic Discretization algorithms for Continuous variables in Step 4 — Discretization and Aggregation of the Data Import Wizard.

Algorithm Details & Recommendations

The Supervised Multivariate discretization algorithm focuses on representing the multivariate probabilistic dependencies involving a Target variable.
It utilizes Random Forests to find the most useful thresholds for predicting the Target variable.
Its function can be summarized as follows:
- Data Perturbation generates a range of datasets.
- For each perturbed dataset, a multivariate tree is learned to predict the Target variable with a subset of variables. If a structure is already defined, it is used to bias the selection of the variables for each dataset.
- Extracting the most frequent thresholds produces the final discretization.
The Supervised Multivariate takes into account the Minimum Interval Weight and can improve the generalization capability of the model.
Being based on Random Forests, this algorithm is computationally expensive and stochastic by nature.
After the conclusion of the Data Import Wizard, the Supervised Multivariate discretization algorithm is also available from Main Menu > Learning > Discretization.
Not that the Supervised Multivariate discretization algorithm is not available via Node Context Menu > Node Editor > States > Curve > Generate a Discretization.

R2-GenOpt

Context

Algorithm Details & Recommendations

The R2-GenOpt algorithm utilizes a Genetic Algorithm to find a discretization that maximizes the R2 between the discretized variable and its corresponding (hidden) Continuous variable.
As such, it is the optimal approach for achieving the first objective of discretization, i.e., finding a precise representation of the values of a Continuous variable.
This algorithm takes into account the Minimum Interval Weight and can also create a specific bin for representing zeros if the Isolate Zeros option is set.
In Validation Mode, the R2 value between the Discretized variable and its corresponding Continuous variable can be retrieved in the Information Mode by hovering over the monitor.

Workflow Illustration

R2-GenOpt*

Context

R2-GenOpt* is a modified version of R2-GenOpt and uses a specific MDL score to choose the number of bins.

Algorithm Details & Recommendations

With 100 observations, even though we selected 8 bins, only 3 were created for the variable 8- Wrist girth.

With 1,500 observations, even though we selected 10 bins, only 5 have been created for AGN, and 6 for ALL.

K-Means

Context

K-Means is one of the Automatic Discretization algorithms for Continuous variables in Step 4 — Discretization and Aggregation of the Data Import Wizard.

Algorithm Details & Recommendations

The K-Means algorithm is based on the classical K-Means data clustering algorithm but uses only one dimension, which is the to-be-discretized variable.
K-Means returns a discretization that directly depends on the Probability Density Function of the variable.
More specifically, it employs the Expectation-Maximization algorithm with the following steps:
1. Initialization: random creation of K centers
2. Expectation: each point is associated with the closest center
3. Maximization: each center position is computed as the barycenter of its associated points
Steps 2 and 3 are repeated until convergence is reached.
Based on the centers K, the discretization thresholds are defined as:

{T_i} = \frac{{{K_i} + {K_{i + 1}}}}{2}\

The following figure illustrates how the algorithm works with K=3.

For example, applying a three-bin K-Means Discretization to a normally distributed variable would create a central bin representing 50% of the data points and one bin of 25% each for the distribution's tails.
Without a Target variable, or if little else is known about the variation domain and distribution of the Continuous variables, K-Means is recommended as the default method.

Density Approximation

Context

Density Approximation is one of the Automatic Discretization algorithms for Continuous variables in Step 4 — Discretization and Aggregation of the Data Import Wizard.

Algorithm Details & Recommendations

The Density Approximation discretization detects changes in the sign of the derivative of the Probability Density Function (PDF) in order to identify local minima and maxima.
Between each local minimum and maximum, the algorithm creates a threshold.

Also, the algorithm automatically detects the optimal number of bins, although you can specify the maximum number of bins.
The minimum size permitted for bins is 1% of the data points.

Normalize Equal Distance

Context

Normalized Equal Distance is one of the Automatic Discretization algorithms for Continuous variables in Step 4 — Discretization and Aggregation of the Data Import Wizard.

Algorithm Details & Recommendations

The Normalized Equal Distance algorithm pre-processes the data with a smoothing algorithm to remove outliers before computing equal partitions.
As a result, the algorithm is less sensitive to outliers than the Equal Distance algorithm.
The algorithm also takes into account the Minimum Interval Weight that defines the minimum prior probability of a bin.
You can adjust the default Minimum Interval Weight under Main > Menu > Window > Preferences > Discretization.

Equal Distance

Context

Algorithm Details & Recommendations

The Equal Distance algorithm computes the equal distances based on the range of the variable.
This method is particularly useful for discretizing variables that share the same variation domain (e.g. satisfaction measures in surveys).
Additionally, this method is suitable for obtaining a discrete representation of the density function.

Equal Frequency

Context

Algorithm Details & Recommendations

This Equal Frequency algorithm defines thresholds so that each interval contains the same number of observations.
This approach typically produces a uniform distribution.
As a result, the shape of the original density function is no longer apparent upon discretization.
This also leads to an artificial increase in the entropy of the system, directly affecting the complexity of machine-learned models.
However, this type of discretization can be useful — once a structure is learned — for further increasing the precision of the representation of continuous values.

Unsupervised Multivariate

Context

This multivariate discretization method is based on analyzing the relationship between variables.

Algorithm Details & Recommendations

The Unsupervised Multivariate discretization algorithm focuses on representing multivariate probabilistic dependencies using Random Forests.
Its functionality can be described as follows:
- A new dataset is created as a clone of the original one.
- In this new dataset, each variable is independently shuffled to render all the variables independent while keeping the same statistics for each variable.
- The cloned dataset is concatenated with the original dataset. Then, a target variable is created to differentiate the clone from the original, indicating the independent set versus the original dependent set.
- Various datasets are generated from this concatenated dataset with Data Perturbation.
- For each perturbed dataset, a multivariate tree is learned to predict the target variable with a subset of variables. If a structure is already defined, it is used to bias the selection of the variables for each dataset.
- Extracting the most frequent thresholds produces the discretization.
- Being based on Random Forests, this algorithm is computationally expensive and stochastic by nature, specifically when the number of variables is important.
The Unsupervised Multivariate discretization algorithm is also available after the data import via Main Menu > Learning > Discretization.
However, it is not available in the Node Editor (Node Context Menu > Edit > Curve > Generate a Discretization).

Aggregation

Unlike the Discretization step, which is mandatory for Continuous variables, Aggregation is optional for Discrete variables.

Note that an analogous function, Generate Aggregations, is also available for Discrete nodes in the States tab of the Node Editor.

This function is useful when dealing with a large number of values in a Discrete variable. Once imported, the large number of resulting Node States would make it difficult to discover any relationships with that node.

The Aggregation function in the Data Import Wizard is available for single Discrete variables and for multiple Discrete variables.

Please see the usage instructions and examples in the corresponding sub-topics:

Aggregation of Single Variable
Aggregation of Multiple Variables

Aggregation of Single Variable

To illustrate all related workflows, we use an American auto buyer satisfaction survey containing 42,397 responses. Each record contains attributes of the purchased vehicle, such as make (or brand), model, body style, vehicle segment, number of cylinders, transmission, price paid, self-reported fuel economy, plus hundreds of other variables.

Manual Aggregation

First, we want to manually aggregate all 37 automobile brands that appear in the survey into just two states, i.e., Premium Brands and Non-Premium Brands.

This manual aggregation will be based exclusively on our subjective perception of the auto industry as of 2009, which is when this particular survey was conducted.

Click on the Brand variable in the Data panel.
From the States list on the left, select the values you wish to aggregate using Shift+Click or Ctrl+Click.

Then, click the Aggregate button.
The newly-formed, aggregated state appears in the Aggregates list on the right.

By default, the original values are concatenated using the "+" symbol as a delimiter. An underscore "_" is added as a prefix.
As necessary, you can select more values from the States list and create additional aggregated states.
In the list of Aggregates, you can now replace the automatically-generated state names with more meaningful ones.

You can now proceed to any other variable or click Finish to conclude the Data Import Wizard.

Workflow Animation

Correlation-Aided Manual Aggregation

Continuing with the previous example, we now perform an aggregation of the same variable, Brand. Now, however, we use each brand's correlation with Price as a guide instead of our judgment.

For the purpose of this demonstration, we have already discretized the Price variable manually into three (arbitrary) intervals using two thresholds, i.e., $25,000 and $45,000.

We now want to use the correlation of each brand with the top interval, i.e., $45,000+, as a measure of its "premium appeal" so that we can reduce the 37 brands into three states, Mainstream, Premium, and Luxury.

For reference, 8.65% of all survey responses reported a vehicle purchase price of $45,000 or higher.

Workflow Instructions

Click on the Brand variable in the Data panel.
Click the Show Correlations box.
Select Target and State.

Review the values shown in the Correlations column. By hovering with your cursor over the Correlation bars in each row, a Tooltip displays the percentage difference of the corresponding row versus the marginal value.
The colored bars show how each value compares to the marginal probability of the selected state of the target. A green-colored bar indicates a probability higher than the marginal probability, and a red bar suggests a lower probability.

Select the states to aggregate using Ctrl+Click.

Once you have selected the values, click the Aggregate button.
The newly aggregated values now appear as a single item in the Aggregates list.

Review the newly aggregated states and, if necessary, assign new names to replace the ones that were generated automatically.
To reverse the aggregation select the aggregated items in the Aggregates list and click Delete.

Workflow Animation

Correlation-Aided Automatic Aggregation

The principal difference is that you don't select your to-be-aggregated values manually but rather specify thresholds that determine the aggregation.

Click on a Discrete variable in the Data panel.
Click the Show Correlations box.

Select Target and State.
Review the values shown in the Correlations column. By hovering with your cursor over the Correlation bars in each row, a Tooltip displays the percentage difference of the corresponding row versus the marginal value.
The colored bars show how each value compares to the marginal probability of the selected state of the target. A green-colored bar indicates a probability higher than the marginal probability, and a red bar suggests a lower probability.

Now, instead of manually selecting the values you want to aggregate, click the Automatic Aggregation button.
The Automatic Aggregation window opens up.

The colored bar at the top visualizes the percentage differences versus the marginal probability of the selected state of the target.
In our example, there is one brand, Mercury, which had no observations in the $45,000+ interval. As a result, it marks the bottom end of the spectrum, i.e., it is 8.65 percentage points below the marginal probability.
On the other end of the spectrum, Porsche is 83.97 percentage points above the marginal probability.
A default threshold is shown for 0, which is marked by the pink-to-red color change in the bar.
You can manually add thresholds by right-clicking on the bar.
As soon as you add a threshold, a corresponding entry appears in the list below.

Right-clicking again on an existing threshold removes that threshold.
You can move an existing threshold by clicking on it and then dragging it to the desired value.

Also, in the table below the colored bar, you can type in a threshold value.

By clicking OK, you confirm the specified thresholds, and all values in the States list will be aggregated accordingly.
Alternatively, you can click on Generate Aggregates and specify the desired number of intervals.
You obtain a set of aggregation thresholds, which you can further modify or accept by clicking OK.
Now you have a new set of states in the list of Aggregates.

Workflow Animation

Aggregation of Multiple Variables

Context

We use the same auto buyer survey dataset to illustrate the process. In the auto industry, numerous schemes are used to group vehicle types and body styles into so-called segments. Each segment carries a descriptive name, e.g., Compact Car, Full-Size SUV, Minivan, Mid-Size Pickup, Mid-Size Crossover. In our dataset, we have four variables, which each represent such a segmentation scheme. While all these segmentation schemes roughly convey the same information, they differ in their granularity: for instance, variable Segmentation 3 has 23 states; Segmentation 4 has 33. Our objective is now to reduce each one of the segmentation schemes down to three states.

This time, instead of Price, we use the variable MPG - Combined as a target. It represents the survey respondents' estimates of their vehicles' combined fuel economy in miles per gallon (MPG). In other words, we want to create a new aggregation for each segmentation scheme based on fuel economy. Also, the variable MPG - Combined only has two intervals, with one threshold at 22.5. This number has been used in the past as a criterion for so-called "gas guzzlers." So, we are going to use the state <=22.5 as a proxy for poor fuel economy. As a result, we expect each of the existing segments to be "remapped" according to fuel economy.

Workflow

In the Data panel, using Ctrl+Click or Shift+Click, select the variables Segmentation 1, Segmentation 2, Segmentation 3, and Segmentation 4.
This brings up the Multiple Aggregation panel.

Set Target to MPG - Combined, and State to <=22.5.
Set Final Number of States to 3.
Click the Aggregate button to perform the aggregation.
Note that there will be no immediate feedback regarding the results of the aggregation.
Rather, we can only see the results of the aggregation in the Import Report in Step 5 of the Data Import Wizard.
Click Finish to complete Step 4 of the Data Import Wizard.
BayesiaLab opens a new Graph Window with all variables now presented as nodes.
Simultaneously, a prompt comes up offering to display the Import Report.

Click Yes, and the Import Report — featuring all variables, not just the aggregated variables — appears in a new window.