Decision Tree: Statistical Terminology

Measure

Description

Analysis Count

The number of records in the analysis selection

Non-Analysis Count

The number of records in the base selection but not in the analysis selection

Base Count

The number of records in the base selection

Analysis %

 

The proportion of analysis selection records within a particular node or dimension value.

i.e. the Analysis Count for the node or dimension value, as a percentage of the Base Count for that node or dimension value.

Non-Analysis %

 

The proportion of non-analysis selection records within a particular node or dimension value.

i.e. the Non-Analysis Count for the node or dimension value, as a percentage of the Base Count for that node or dimension value

Dimension Value

In the case of a selector variable, this is one of the variable categories (e.g. YOB = 1970)

% of [all] Analysis

The proportion of the whole analysis selection within a particular node or dimension value.

  •  In the case of a node this is the Analysis Count for the particular node, as a percentage of the Analysis Count for the root node.
  •  In the case of a dimension value this is the Analysis Count for the particular dimension value, as a percentage of the Analysis Count for the variable (all dimension values).

% of [all] Non-Analysis

The proportion of the whole non-analysis selection within a particular node or dimension value.

  • In the case of a node this is the Non-Analysis Count for the particular node, as a percentage of the Non-Analysis Count for the root node.
  • In the case of a dimension value this is the Non-Analysis Count for the particular dimension value, as a percentage of the Non-Analysis Count for all dimension values.

% of [all] Base

The proportion of the whole base selection within a particular node or dimension value.

  • In the case of a node this is the Base Count for the particular node, as a percentage of the Base Count for the root node.
  • In the case of a dimension value this is the Base Count for the particular dimension value, as a percentage of the Base Count for all dimension values.

Index

The Index measures the extent to which the node has homed in on the analysis selection, compared to the root node

  • An index of 100: % of Analysis = % of Base
  • An index of greater than 100: % of Analysis greater than % of Base
  • An index of less than 100: % of Analysis less than % of Base

Tfor a node or dimension value, expressed around 100. % of Base""to  % of Analysis" "he index is the ratio of

Gain

The Gain measures the extent to which the node has homed in on the analysis selection, compared to the root node.

  • A gain of 1.0 = this node has the same Analysis % as the root node.
  • A gain of more than 1.0 = this node has a higher Analysis % as the root node.
  • A gain of less than 1.0 = this node has a lower Analysis % as the root node.

The Gain of a node is the Analysis % for the node divided by the Analysis % in the root node

Gini

The Gini Impurity measure for the node (range = 0 to 1)

  • Min = 0, when all of node belong to same target group (Yes or No)

Calculated from “1 - Sum of Squared Probabilities of membership to each group”

Chi Square

This provides a measure of association between the dimension being examined and the probability that someone is in the analysis selection.

Cramer’s V

This provides a measure of association between the dimension being examined and the probability that someone is in the analysis selection.  It is a commonly used measure for assessing the association in a 2-way contingency table, based on the Chi-square value, but adjusting for the number of records and the shape of the table.

  • Cramer’s V values range from 0.0 (no association) to 1.0 (maximum association).

A modified version of the traditional formula is used, to allow for the fact that all statistics are based on a binary response (analysis vs non-analysis).  Instead of the standard formula shown below, the divisor is always (r-1) rather than the minimum, since c is always 2.  The divisor is essentially the number of categories minus one.

Z-Score

The Z-Score of a node measures the significance of any difference in terms of Analysis %, between this node and its parent, or between a dimension value and the dimension overall.

Z-Score = (Difference in Analysis %) / Standard Error

Standard Error = Sqrt (Parent Analysis % * (1-Parent Analysis %) / Node Base Count)

  • A large Z-Score (positive or negative) indicates that the child node is significantly different to its parent.
  • A small Z-Score (close to zero) indicates that any difference in Analysis % is not very significant.  This is more likely when some or all of the following apply:  
    • a small difference in Analysis %
    • a small node base count
    • a very high (e.g. >95%) or very low (e.g. <5%) Analysis %.

Minimum Z Score (after splits)

This is a measure of how different the least distinct child node is compared to the parent  node.  When a split is made, a Z-Score is calculated for each of the child nodes and this measure is the minimum (absolute) value of these scores.

Power

The power of a Decision Tree model measures how good the Decision Tree is at identifying records in the Analysis selection.  It ranges between 0 and 1 (best)

  • Power =  0: Decision Tree is no better than random
  • Power = 1: Decision Tree is as good as hindsight.  Selecting the best nodes from the Decision Tree will enable you to select all the Analysis Selection without picking up any of the base selection.

The power is in fact calculated from the Gains Chart curves and is based on the distance between the Decision Tree line and the Random and Hindsight lines.  The power is the ratio of (the area between the Decision Tree and Random Line) to (the area between the Hindsight Line and Random Line).

Significance
(P value)

The significance of the split is assessed using a Chi Square test to measure the association between the child nodes and the analysis selection.

  • The higher the Significance figure the more significant the child nodes.  These figures are capped at 10 (equivalent to 0.0000000001, i.e. 10 decimal places).
  • A Significance of 0 indicates that the child nodes are not significant.

The Significance figure is calculated as  - Log10 (P value), where the P value is taken directly from the Chi Square test.  The smaller the P value the more significant the child nodes.  Typically the values are so small they appear as 0.0000.  The Significance value may therefore be more useful.

Bonferonni Adjustment

The Bonferonni Adjustment reduces the P value to allow for the fact that multiple dimensions are being used.  All it does is divide the P value by the number of splits being considered at a node.

The P value is the probability of making a mistake when deciding to use a split (i.e. the probability that you make a split when actually there are no real differences in the data that warrant dividing a node further).  Setting a P value of 0.05 means you want the chance of making a mistake to be 5%.  This scope for error is divided out between the candidate splits, so that if you had 5 splits, you would limit your scope for error to 1% for each split.  This is a very conservative approach, but does mean that your probability of making a mistake is no more than 5% (the P value) overall.

Measures used in the Next Splits

The Next Splits panel displays statistics both before and after a split is created.

The "Before Splits" statistics are based on all the individual categories in the dimension.  For example, the Chi Square (Before Splits) statistic for the Income variable, will be based on an 11 x 2 contingency table, since there are 11 income bands.

The "After Splits" statistics are based on the branches created during the split.  For example, the Chi Square (After Splits) statistic for the Income variable, will be based on an 2 x 2 contingency table, since there are only 2 branches, despite there being 11 income bands.

The "Minimum Z Score (after splits)" statistic is also based on the branches, and is set to be the lowest Z Score associated with each of the branches.