Decision Tree: What can I learn from the Next Splits panel?

The Next Split panel allows you to investigate the decisions made at any particular split.  As each node is split, each of the dimensions is examined and essentially a profile like report is generated showing the analysis and base counts within each value of the dimension (e.g. each income band).  The nodes produced at a split try to isolate customers in the analysis selection from those not in the analysis selection.

The Algorithm Options dictate which dimension is used to split a node, and how the values of the dimension are grouped to create the child nodes.  The default PWE algorithm does this by separating into separate nodes, values which are above/below the Mean PWE for the variable.  This essentially isolates values for which the Analysis % is high, i.e. containing more Swedish holiday makers, from those where it is low.   For more details on setting these see the section How do I set the Algorithm Options?

The Next Split panel summarises the result of this process:

For more details see below.  For an example of using the information in the Next Splits panel see the Example using the Next Splits panel

Colours Used

The same colouring system, based on analysis %, is used within the Next Splits panel as elsewhere in the Decision Tree.  The default colours are:

  • Red - Analysis % higher than in the root node

  • Blue - Analysis % lower than in the root node

  • Grey - Analysis % in line with the root node

The colour used for the majority of the row is based on the analysis % of this individual category.

The colour used for the Branch column is based on the combined analysis % of all categories that are assigned to this branch.

Looking at the example above, it is apparent that the unclassified category has an analysis % lower than the root node (the row is coloured blue), but it has been assigned to branch (1) which on average has an analysis % higher than the root node (branch cell is coloured red).  You might want to try manually altering this split to group the unclassified in the other branch.  This actually produces a split that is not quite so good (Z score of 80.34 instead of 80.46).

Dimension Level statistics

The key statistics at a dimension level are:

  • Status:

Status

Description

Used

The [valid] split that has been used to create the child nodes.  In automatic mode this will be the split with a rank of 1.

Forced

The [INvalid] split that has been used to create the child nodes.  In automatic mode, an invalid split will never be used.

Valid

A split which has passed all the split validation checks, but has not been used.

Invalid

A split which has failed one or more of the split validation checks, but has not been used.

Excluded By User

A split which has not been included in the Dimensions Panel.  In automatic mode an excluded dimension will never be used to split a node.

No Splits

No splits could be created for this dimension.  The most common reason for this is that every person in this node was in the same category (e.g. all have the same Income band).

  • Reason if invalid:

Reason Invalid

Description

None

The status is Used or Valid

Resulting nodes are too small

At least one branch node created by this split has a base count lower than the Minimum Size (new nodes) stopping reason.

Split is not significant

The Significance value of this split is lower than the Significance validation condition.

Categories are too different to merge

The Multi-Way PWE strategy or Chaid strategy has resulted in more branches than the maximum allowed.  

All categories are allocated to the same branch

The Multi-Way PWE strategy or Chaid strategy has resulted in a single branch

All categories are 100% pure

All the categories are contain either all or no records from the analysis selection.

Insufficient categories to create minimum number of nodes

It is not possible to create a split with the minimum number of nodes, as there are not enough populated categories in this node for this dimension.

All non-zero flags have same counts

For Flag Array variables, all populated categories have the same counts.  E.g. the number of people in the analysis and base selection is the same for all the Newspapers.

  • Rank: This indicates how good the dimension is for use in splitting the node

    1 = the dimension is the best for splitting (based on the Split Selection Measure in the Advanced Algorithm Options)

    2+ = the order of this dimension amongst the other valid dimensions.

    0 = the dimension is Invalid

  • Description: Description of the dimension (e.g. Income)

  • Name: System name of the variable (e.g. peIncom)

  • Cardinality: The number of categories of this variable which are populated within this node

  • Number of Branches : The number of child nodes created by this split.

    Will be set to 0 if the split has not been created successfully.

  • Mean Index: The mean Index across the values in this dimension

  • Minimum Index: The minimum Index across the values in this dimension

  • Maximum Index: The maximum Index across the values in this dimension

  • Cramer’s V (before split) : Statistical measure based on the individual categories

    The higher the value the better the dimension is for creating splits, since the individual categories show differences in their analysis %.

    Since it is based on the categories before splitting, this measure is unaffected by the splits created.

    With the Split Selection Measure set to Cramer’s V, this is the measure used to rank the dimensions and chose which to split on.

  • Chi Square (before split) : Statistical measure based on the individual categories.

    The higher the value the better the dimension is for creating splits, since the individual categories show differences in their analysis %.

    Since it is based on the categories before splitting, this measure is unaffected by the splits created.

    With the Split Selection Measure set to Cramer’s V, this is the measure used to rank the dimensions and chose which to split on.

  • Mean PWE: The mean PWE across the values in this dimension

    With the Split Creation Strategy set to Mean Split, values (e.g. Income bands) are separated into the 2 splits based on whether their PWE is higher or lower than this.

  • Significance : Statistical measure based on the splits created.

    The higher the value the better the splits created are.

    Since it is based on the splits created, this measure changes as the splits change.

  • Cramer’s V (after split) : Statistical measure based on the splits created.

    The higher the value the better the splits created are.

    Since it si based on the splits created, this measure changes as the splits change.

  • Chi Square (after split): Statistical measure based on the splits created.

    The higher the value the better the splits created are.

    Since it is based on the splits created, this measure changes as the splits change.

  • Minimum Z Score (after split): Statistical measure based on the splits created.

    The higher the value the better the splits created are.

    Since it is based on the splits created, this measure changes as the splits change.

  • Last Merge : An indication of how easily the CHAID or PWE algorithm created the splits.

    For Chaid, this is the Chi Square p value between categories taken from the last successful merge. (lower = more different).

    Merge is not allowed when value is lower than "p value to merge".   Lower values of "p value to merge" lead to fewer branches.

See the section on Statistical Terminology for more details.

Value level statistics

The key statistics for each value of the dimension are:

  • Branch: Which child node the value was potentially allocated to.

    For the dimension that was used, this will be the node number

    For dimensions that were investigated but not used, this will be (1) or (2)

  • Code: Value code within the dimension (e.g. [income] 01)

  • Description: Description of the dimension value (e.g. [income] “<£10k”)

  • Analysis Count

  • Non-Analysis Count

  • Base Count

  • Analysis %

  • Non-Analysis %

  • Base %

  • % of Analysis

  • % of Non-Analysis

  • % of Base

  • Index

  • Gain

  • Z-Score

See the section on Statistical Terminology for more details.