Machine Learning - Bias and Extreme Edges
March 19, 2019By: Jason M. Pittman, ScD
We left off last time examining the ID3 algorithm and the equations for determining a decision tree root. So now we have a decent working knowledge of decision trees along with some idea of types of problems machine learning is useful for and accompanying algorithms. Machine learning is more than what we've looked at so far, much more in fact. Thus, based on our position in the demystifying machine learning continuum, I thought it may be beneficial to discuss biasing and what I think of as extreme edges.
Bias is interesting to me because it tends to be largely unnoticed. Extreme edges are curious phenomena because they are technically errors but mysteriously appear to be valid results. Mainstream media has been very attentive to these types of issues and politicians have jumped onto the bandwagon.
Bias and Extreme Edges
The operationally important question we should ask is, what leads to bias or extreme edges? First, I need to be careful to distinguish between the two conditions properly. Definition-wise, bias is inaccuracy in our assumptions while extreme edges are features or labels more akin to outliers.
I think an obvious answer is not everything is learnable. Yes, that's right - machine learning is not applicable to every problem. We've loosely talked about this response so I'll leave it be. For the time being, I want to concentrate on learnable situations.
In learnable situations, a less obvious answer is that our sense of what the answer must be before we assemble the training dataset and make decision selections directly leads to biasing in the processed information. Here, we can overlook or outright exclude data that clarify the problem. A critical element to understand here is that the algorithm itself is often not the source of the bias. No, the algorithm doesn't introduce bias but it will amplify such.
Perhaps an even less obvious answer is that our decision tree can be too shallow or too deep. In the former, we can inadvertently lock into early decisions without finding later, perhaps more optimal decisions. Meanwhile, if our tree is too deep we can either run into a computational boundary for performance or over-select features which appear desirable but really are extremes unrepresentative of the general population.
Lastly, we can think about noise in the data. Noise can manifest as errors in the data, incomplete data, or even as issues with the substrate containing the source data. Effects can be subtle or profound but often are simplistic in nature. Think about typos, assigning a value of 1 when you intended a 5, or a misaligned sensor that generates skewed data. Any of these things could be present in the training data or in the sample.
What to Do About It
There are more types of biases and extreme edges. However, we can pause and pose a solid follow up question: what can we do to prevent these issues? First, we should always ensure that our training data is clean. That means checking and rechecking the source. Second, we should never, under any circumstances, actually look at the training data. Well, how do we design the algorithm then? That's a great question to explore next time!