Decision Tree: How do I set the Split Validation Options?

The Split Validation Options are set on the Advanced Options tab of the Build Options menu.

The minimum and maximum allowed number of branches affects split validation.  See Updating Split Validation Options

Candidate splits are proposed for each variable according to the Algorithm Options.  See section How do I set the Algorithm Options?.  If the candidate splits fail any of the following conditions then the split is said to be invalid.  A node will be stopped from splitting, if the candidate splits for all dimensions included in the analysis are invalid.

The Reason for Status column of the Next Splits panel provides information on why splits have been flagged as invalid.

Split Validation Conditions

Split Validation Condition

Description

Minimum Size
(new Nodes)

A split is declared invalid if any of the resulting child nodes have a smaller base count than that specified by this stopping condition.

 

Significance
(P value)

The significance of the split is assessed using a Chi Square test to measure the association between the child nodes and the analysis selection.

 

The P value shown here is taken directly from this Chi Square test.  The smaller the P value the more significant the child nodes.  Typically the values are so small they appear as 0.0000.  The Significance value may therefore be more useful.

 

The Significance figure is calculated as  - Log10 (P value).  The higher the Significance figure the more significant the child nodes.  These figures are capped at 10 (equivalent to 0.0000000001, i.e. 10 decimal places).

 

Bonferonni Adjustment

The Bonferonni Adjustment reduces the P value to allow for the fact that multiple dimensions are being used.  All it does is divide the P value by the number of splits being considered at a node.

 

The P value is the probability of making a mistake when deciding to use a split (i.e. the probability that you make a split when actually there are no real differences in the data that warrant dividing a node further).  Setting a P value of 0.05 means you want the chance of making a mistake to be 5%.  This scope for error is divided out between the candidate splits, so that if you had 5 splits, you would limit your scope for error to 1% for each split.  This is a very conservative approach, but does mean that your probability of making a mistake is no more than 5% (the P value) overall.

 

Updating Split Validation Options based on Stopping Reasons

The Reason for Status column of the Next Splits panel provides information on why splits have been flagged as invalid.  For each dimension, investigate the reason why the splits were invalid and follow the action suggested in the table below.

Reason Invalid

Explanation and Necessary Action

Resulting nodes are too small

At least one branch node created by this split has a base count lower than the Minimum Size (new nodes) stopping reason.

 

Decrease the Minimum Size (new nodes) validation condition

 

Split is not significant

The Significance value of this split is lower than the Significance validation condition.

 

Decrease the Significance validation condition or deactivate the use of this condition by un-ticking the box next to it.

 

Categories are too different to merge

The Multi-Way PWE strategy or Chaid strategy has resulted in more branches than the maximum allowed.  

 

Either increase the maximum number of branches or adjust the merge criteria within the PWE or Chaid Advanced Options.

 

All categories are allocated to the same branch

The Multi-Way PWE strategy or Chaid strategy has resulted in a single branch.

 

Adjust the merge criteria within the PWE or Chaid Advanced Options

All categories are 100% pure

All the categories are contain either all or no records from the analysis selection.

 

You have found a rule which perfectly identifies people who are in or not in your analysis selection.  Check whether you have used a dimension that is in fact a consequence (not a cause) of being in the analysis selection.  See How do I choose my Dimensions? section.

Insufficient categories to create minimum number of nodes

It is not possible to create a split with the minimum number of nodes, as there are not enough populated categories in this node for this dimension.

 

This could be the result of previous splits using this dimension.  For example if a previous split created a node based on the rule "Occupation = Student", then it will not be possible to use the Occupation dimension again since the minimum number of nodes required at a branch is at least 2.

 

The same situation could arise due to the use of a different dimension.  Perhaps a rule saying "Income = Nothing" and "Age < 18" could result in all people in a node being students.

All non-zero flags have same counts

For Flag Array variables, all populated categories have the same counts.  E.g. the number of people in the analysis and base selection is the same for all the Newspapers.

 

For example, the node may contain people who just read The Times or The Guardian.  Potentially these 2 groups could be split in to separate nodes.  However, if say both newspapers have 10% of people in the analysis selection, there is no basis on which to separate them.