=== finding best splits in decision trees ===
Notation: We let TTTF stand for a population of four,
three instances of which are classified as T and one as F, etc.
-- Examples of Gini purity --
Population Gini purity
----------------------------------
TTT 0
TF 1 - (1/4+1/4) = 1/2
TTF 1 - (1/9+4/9) = 4/9
TTTF 1 - (1/16+9/16) = 3/8
TTTTF 1 - (1/25+16/25)= 8/25
--- Example of choosing best split (wrt. Gini purity)
Let us have a population
1: T, 2: T, 3: F, 4: T
which we can split in two ways:
A: (1: T, 2: T) and (3: F, 4: T)
B: (1: T, 2: T, 3: F) and (4: T)
B might seem best, in that one component is totally pure,
and the other (TTF) is "more pure" than the FT component of A.
However, A has the highest information gain, as demonstrated by the
below calculation. Intuitively, this is because the totally pure
component of A (TT) is larger than the totally pure component of B.
---------------------------------------------------------------------------
A B
--------------------------------------------------
Purity 1/2 x 0 + 1/2 x 1/2 = 1/4 x 0 + 3/4 x 4/9
1/4 1/3
Information-gain 3/8 - 1/4 = 3/8 - 1/3 =
1/8 > 1/24
=== ====
Information-content 1/2 log(2) + 1/2 log(2) = 1/4 log(4) + 3/4 log(4/3) =
1 2 - 3/4 log(3) =
0.811
Information-gain-ratio 0.125 > 0.051
===== =====
---------------------------------------------------------------------------
-- motivating need to take information content into account
Let us have a continuous population where
N items are less than a given threshold W
N items are greater than W
3/4 of the items less than W are classified as T
1/4 of the items less than W are classified as T
We can split in two ways:
A binary, along W
B fine-grained, making all classes pure
---------------------------------------------------------------------------
A B
--------------------------------------------------
Purity 1/2 x 3/8 + 1/2 x 3/8 =
3/8 0
Information-gain 1/2 - 3/8 = 1/2 - 0 =
1/8 < 1/2
Information-content 1/2 log(2) + 1/2 log(2) = log(2n)
1
Information-gain-ratio 1/8 1/(2 log(2n))
=== =============
-------------------------------------------------------------------
So for sufficiently large n, option A is to be preferred,
as the extra purity in B is not worth the more fine-grained split.