Decision Tree: Statistical Terminology
Measure |
Description |
Analysis Count |
The number of records in the analysis selection |
Non-Analysis Count |
The number of records in the base selection but not in the analysis selection |
Base Count |
The number of records in the base selection |
Analysis %
|
The proportion of analysis selection records within a particular node or dimension value. i.e. the Analysis Count for the node or dimension value, as a percentage of the Base Count for that node or dimension value. |
Non-Analysis %
|
The proportion of non-analysis selection records within a particular node or dimension value. i.e. the Non-Analysis Count for the node or dimension value, as a percentage of the Base Count for that node or dimension value |
Dimension Value |
In the case of a selector variable, this is one of the variable categories (e.g. YOB = 1970) |
% of [all] Analysis |
The proportion of the whole analysis selection within a particular node or dimension value.
|
% of [all] Non-Analysis |
The proportion of the whole non-analysis selection within a particular node or dimension value.
|
% of [all] Base |
The proportion of the whole base selection within a particular node or dimension value.
|
Index |
The Index measures the extent to which the node has homed in on the analysis selection, compared to the root node
Tfor a node or dimension value, expressed around 100. % of Base""to % of Analysis" "he index is the ratio of |
Gain |
The Gain measures the extent to which the node has homed in on the analysis selection, compared to the root node.
The Gain of a node is the Analysis % for the node divided by the Analysis % in the root node |
Gini |
The Gini Impurity measure for the node (range = 0 to 1)
Calculated from “1 - Sum of Squared Probabilities of membership to each group” |
Chi Square |
This provides a measure of association between the dimension being examined and the probability that someone is in the analysis selection. |
Cramer’s V |
This provides a measure of association between the dimension being examined and the probability that someone is in the analysis selection. It is a commonly used measure for assessing the association in a 2-way contingency table, based on the Chi-square value, but adjusting for the number of records and the shape of the table.
A modified version of the traditional formula is used, to allow for the fact that all statistics are based on a binary response (analysis vs non-analysis). Instead of the standard formula shown below, the divisor is always (r-1) rather than the minimum, since c is always 2. The divisor is essentially the number of categories minus one. |
Z-Score |
The Z-Score of a node measures the significance of any difference in terms of Analysis %, between this node and its parent, or between a dimension value and the dimension overall. Z-Score = (Difference in Analysis %) / Standard Error Standard Error = Sqrt (Parent Analysis % * (1-Parent Analysis %) / Node Base Count)
|
Minimum Z Score (after splits) |
This is a measure of how different the least distinct child node is compared to the parent node. When a split is made, a Z-Score is calculated for each of the child nodes and this measure is the minimum (absolute) value of these scores. |
Power |
The power of a Decision Tree model measures how good the Decision Tree is at identifying records in the Analysis selection. It ranges between 0 and 1 (best)
The power is in fact calculated from the Gains Chart curves and is based on the distance between the Decision Tree line and the Random and Hindsight lines. The power is the ratio of (the area between the Decision Tree and Random Line) to (the area between the Hindsight Line and Random Line). |
Significance |
The significance of the split is assessed using a Chi Square test to measure the association between the child nodes and the analysis selection.
The Significance figure is calculated as - Log10 (P value), where the P value is taken directly from the Chi Square test. The smaller the P value the more significant the child nodes. Typically the values are so small they appear as 0.0000. The Significance value may therefore be more useful. |
Bonferonni Adjustment |
The Bonferonni Adjustment reduces the P value to allow for the fact that multiple dimensions are being used. All it does is divide the P value by the number of splits being considered at a node. The P value is the probability of making a mistake when deciding to use a split (i.e. the probability that you make a split when actually there are no real differences in the data that warrant dividing a node further). Setting a P value of 0.05 means you want the chance of making a mistake to be 5%. This scope for error is divided out between the candidate splits, so that if you had 5 splits, you would limit your scope for error to 1% for each split. This is a very conservative approach, but does mean that your probability of making a mistake is no more than 5% (the P value) overall. |
Measures used in the Next Splits
The Next Splits panel displays statistics both before and after a split is created.
The "Before Splits" statistics are based on all the individual categories in the dimension. For example, the Chi Square (Before Splits) statistic for the Income variable, will be based on an 11 x 2 contingency table, since there are 11 income bands.
The "After Splits" statistics are based on the branches created during the split. For example, the Chi Square (After Splits) statistic for the Income variable, will be based on an 2 x 2 contingency table, since there are only 2 branches, despite there being 11 income bands.
The "Minimum Z Score (after splits)" statistic is also based on the branches, and is set to be the lowest Z Score associated with each of the branches.