Machine Learning: Finding the Root
February 26, 2019Jason M. Pittman, ScD
You may have thought we were done with decisions trees. We are done, with respect to discussing the general approach, type of problems addressed, and the algorithms associated with decision trees. You could say that we're moving from a view of the forest, to a view of specific trees. However, there is a bit more to explore when it comes to the underlying mathematical functions associated with navigating data to construct our trees. In our last discussion, I introduced the concept of a cost function and gave a specific example in the Gini coefficient. Gini calculations are useful for a specific type of decision tree algorithm referred to as CART or classification and regression trees. For other algorithms, such as Iterative Dichotomizer 3 (ID3), we need another method.
Iterative Dichomomiser Decisions
Strictly speaking, we're not computing a split cost when using the ID3 algorithm. The approach is different here compared to the CART algorithm because the algorithms are...well, different. Both are decision tree algorithms, but CART works with binary decisions; whereas, ID3 implements a greedy search to construct a top-down decision tree. Thus, ID3 doesn't split on a decision. The algorithm constructs decisions based on entropy and information gain. More specifically, we need to select what we can construe as a best fitting attribute as we iterate over the set of possible decision trees; top-down or root node first. At the heart of an ID3-based decision is how well a selected attribute represents a classification we're interested in. This is where entropy and information gain enter the picture.
Entropy, or the measure of homogeneity within the data, yields a value of either one or zero. If the data are the same, entropy is zero. If the data are equally divided, we get an entropy metric of one. Why are those the only possible values? Remember, we're dealing with decision trees! ID3 creates subsets within a tree structure, based on the entropy metric- zeroes are grouped together as are ones. Entropy is calculated as follows.
For one attribute: E(S)=∑_1^c- pi log_2 p_i
We can also measure two attributes as such: E(T,X)=∑_cP(c)E(c)
In turn, information gain relies on our concept of entropy, insofar as information gain goes up as a decision split reduces entropy. Specifically, information gain uses the entropy measure from each potential decision branch split, starting with the root node and moving outward towards the lowest leaves.
Whew. That's deep...
The simple equation associated with information gain paints a clearer picture:
Gain (T,X) = Entropy +(T) – Entropy(T,X)
Thus, we can imagine that there is an inner, hidden simulation going on whereby potential decisions are represented by attribute splits, and are comparatively evaluated according to the highest information gain measure.
I view this as a type of pathfinding or even (network) routing protocol. The technical aspects of this analogy may be worth exploring in the near future. Until then, we ought to begin thinking about alternatives to decision trees. Perhaps a good place to visit next is with our neighbors.