Hello from Prognoz!

Benefits of decision trees in solving predictive analytics problems

02.06.16 Viktor Gavrilov
7359 0
three fall trees isolated on white

Over the past few decades, decision tree algorithms have been in vogue for solving predictive analytics problems. Because they’re generic, and because they utilize an effective calculation procedure, they can easily deliver an optimal solution to your specific problem. These methods are the focus of this blog post of mine.

A decision tree is a data structure where each non-leaf node specifies a test of some attribute. We start from the root node of such a tree and—according to test outcomes—move down tree branches until the bottommost leaf nodes are encountered. These nodes hold some class label, or the values of the target attributes.

Decision trees can handle values of qualitative, or categorical, attributes (that is, attributes with a fixed number of discrete values) by assigning objects to a specific class in a classification problem, along with values of quantitative, or continuous, attributes in a regression problem.


Let’s consider an algorithm that constructs a decision tree for estimating and predicting values of a categorical attribute, or categorical class labels, in an analyzed data set based on the values of other attributes—in other words, a classification problem.

There’s an endless number of ways for constructing decision trees. We can arrange attributes differently, apply various attribute test conditions at nodes, and stop the recursive process using diverse stopping criteria. But we’re particularly interested only in those trees that predict attribute values as accurately as possible—at a minimum error rate—along with capturing dependencies among attributes and predicting class labels of many previously unseen records correctly.

The bad news here is that there are no good and proven algorithms for inducing such an optimal tree within a reasonable time. But the good news is that there are perfectly serviceable algorithms for this task. They try to construct a nearly optimal tree by performing a definite local optimality criterion at each iteration in an attempt to grow a tree that is optimal on the whole. Such an algorithm is called greedy. And this is the algorithm that I’m going to talk about.

Decision tree growing algorithm

This method employs a top-down approach under which a tree is grown from its roots. The algorithm selects which attribute to test at the root node of the tree. For this purpose, each attribute is examined in order to identify the only one that optimally classifies a given data set—that is, splits the set into classes based on the target attribute. The algorithm uses a greedy search to pick the best attribute and never looks back to reconsider previous choices.

Once the attribute is selected, a tree branch is grown for each value of this attribute, and the data set is split based on the value of each branch. This process is repeated on each derived branch in a recursive manner. The recursion is over once the specified stopping criteria are met.

The main question is how to pick attributes for splitting data based on an attribute value test. The approach provides for completing the recursion when the bottommost leaf nodes have the class label of the target attribute. So a data set at each node should be split into subsets that have a more homogenous class distribution: For example, the greater part of the subset belongs to the class watermelon. A quantitative criterion should also be introduced in order to assess the homogeneity of a distribution.

Let’s consider a set of possibilities  describing a possibility that some data record of the set  belongs to class . Let’s calculate the following entropy:


According to information theory, entropy measures the quantity of information in terms of bits required to encode a message. It says that a randomly selected object, or a record, of set X belongs to some class and to transmit this message to the receiver. When all records belong to a single class, the entropy is zero (0log20 = 0), and there’s no need for sending any messages to the receiver.

When records are equally distributed among all classes, the entropy is maximal, that is: log2c of bits is required while c is the total number of classes.

Now let’s calculate the so-called information gain for each attribute A in order to pick an attribute for splitting our data:


where values(A) are all taken values of attribute A; and Xa is a subset of a data set, where A = a, |X| is the number of elements in this subset.

The information gain describes an expected decrease in entropy after splitting a data set on a selected attribute. The second addend is the sum of the weighted average of entropies for each derived subset. The overall difference shows the reduction in entropy and the number of bits saved when encoding the class of a random object from set X, if we know the values of attribute  and split our data set into subsets on this attribute.

The algorithm chooses the attribute providing the largest, or maximum, information gain, or the greatest reduction in entropy.

Once the attribute is picked, the original data set is split by the selected attribute to produce subsets of data. The algorithm is run recursively, iterating through every unused attribute of the set.

This process stops when the created subsets are homogenous enough (a single class prevails); that is when the max(Gain(X,A)) is less than some set parameter Θ, which is close to zero. As an alternative, you can control set X and stop the process when this set is small enough or completely homogenous (where all the elements of this set belong to a single class).

The stepwise summary for this algorithm is as follows:

  1. If max(Gain(X,A)) < Θ, turn the node into a leaf and assign the label of the prevailing class to this node
  2. If there are no more attributes to split data on, turn the node into a leaf and assign the label of the prevailing class to this node
  3. Otherwise:
    • Choose an attribute corresponding to the maximum Gain(X,A)
    • Create a branch for each value of this attribute
    • For each branch:
      • Create a subset Xa, removing the attribute from the set of attributes
      • Call the algorithm recursively for the subset Xa

If an attribute is a quantitative one—any real number, for example—data sets are split on this attribute using the A ≤ a0 test, and two branches are grown respectively for true and false. The optimal method for selecting a0 here also relies on the information gain concept.

If the target attribute is quantitative, the trees that result are used for solving regression problems where leaves predict a real number—not a class, which is the case with classification problems. In case of regression, the algorithm looks for splits that minimize the sum of squared deviations.

The algorithm above was originally created by J. R. Quinlan. You might know it as Iterative Dichotomiser 3, or ID3.

Let’s consider how the ID3 algorithm picks attributes, as this is the key point here. The classification example below has nothing to do with political correctness. However, because of its demonstrability, it was favored by Quinlan himself.

Height Hair Eyes Attractive?
Short Blond Brown No
Tall Dark Brown No
Tall Blond Blue Yes
Tall Dark Blue No
Short Dark Blue No
Tall Red Blue Yes
Tall Blond Brown No
Short Blond Blue Yes

We’re going to turn this absolutely politically incorrect data  into a decision tree where “Attractive?” will be the target attribute.
Let’s calculate entropy with regard to a simple binary classification of Yes or No for our target attribute.


Now let’s calculate the information gain for each attribute:


Consequently, during the first step, the maximum information gain is for the attribute Hair, which is the most informative one. This is the optimal attribute for splitting our data set into the Yes and No classes. Our data table above clearly demonstrates this fact as (Hair = Dark) induces (Attractive? = No); (Hair = Red) induces (Attractive = Yes). The value Blond induces no information. The attribute Height is practically of no importance, since it’s the most non-informative one here.
This is sufficient enough to illustrate how the ID3 algorithm works, but you can continue growing this tree.
The ID3 algorithm is implemented in the data mining module of the Prognoz Platform. With this algorithm, you can construct a decision tree and represent it as sets of if–then rules. Such a tree can be used to classify data by filling in missing values in the target attribute. You can call data mining functions from any tool of the Prognoz Platform for your current data table, along with using the database table.

Advantages and drawbacks of decision trees

Decision trees are beneficial, since they are:

  • Interpretable at a glance
  • Suitable for handling both categorical and quantitative values
  • Universal for solving both classification and regression problems
  • Capable of handling missing values in attributes and filling them in with the most probable value
  • High-performing with regard to searching down a built tree, because the tree traversal algorithm is efficient even for massive data sets

These advantages need to be tempered with some drawbacks of decision trees, such as:

  • Decision trees can be unstable. Even minor perturbations in a data set can produce a drastically different tree due to the hierarchical structure of the tree, where any modification at the top levels result in changes further down the tree.
  • It can be difficult to control the size of the tree. The size of a decision tree is critical for ensuring the quality of the problem-solving process. It should be noted that decision trees may often grow to become too short or too big when you rely on simple stopping criteria.
  • In some complex cases, splitting data into classes might not be helpful. Simple trees split data at nodes on a single attribute value parallel to the coordinate axes, so to say, which means that each attribute is a coordinate axis that has its own values. This leads to rectangular classification boxes that group data points corresponding to this or that class. Such partitioning may not correspond well with the actual distribution of class-specific data points in the decision space of some intricate cases.
  • Information gain is prone to prefer attributes with a large number of different values. Each record may have its attribute value in extreme cases. This means that the second addend in Gain(X,A) is equal to 0, resulting in the maximum information gain.

As you can see, these disadvantages aren’t trivial, but advanced decision tree algorithms can easily handle and overcome them.

The bottom line

A decision tree is a very typical example of an algorithm for learning from training examples. Here values of source data attributes are used to construct some decision function, evaluate model parameters, and so on. It is not the user but the algorithm itself that automatically builds a prediction model based on source data. The user selects a method—a decision tree, as in our case—and identifies a class of models to be built and trained using this algorithm.

My next article will be devoted to some modifications of a decision tree induction process to enhance its quality, including the super-popular Random Forest algorithm that draws on the ensemble modeling approach.