Decision Tree: How do I configure the dimensions?

Having chosen the dimensions there are a number of options available to you influence how they are used. 

These options are explained in the following sections.  For some examples of how to use the options, see Suggested Uses of Dimension Settings.

Creating and Using Dimensions

All dimensions dragged on to the dimensions tab are processed for each node and the results displayed in the Next Splits panel.

There are varying degrees to which dimensions can be processed within the Decision Tree.    The "Create Split" and "Use Split" check boxes are used to determine how dimensions are processed over and above the creation of counts.

 

Description

Neither

Create Split
nor
Use Split

Counts will be created for all dimensions that are dragged on to the dimensions tab.  Categories will not be combined to make splits.

  • You can manually group categories into branches and then select to use the split.
  • Using this option saves processing time.  Creating Chaid splits for high cardinality variables can take considerable time.
 

Create Split

Categories are combined to make splits, but the split will not be selected automatically for use in creating child nodes.

  • You can manually select a split that has been created by right clicking on the split.

Create Split

and

Use Split

Categories are combined to make splits, and the split can be selected automatically for use in creating child nodes.

Unclassifieds

There are a number of ways of handling missing values in the data:

Unclassified Handling

Description

Free-Floating

This is the default option.  

 

People with unclassified values are included in the Decision Tree.  At each split the rules created are allowed to combine the unclassified category with any of the other categories. 

 

  • This can be useful to get an initial understanding of your customers, but can be misleading if there are a lot of unclassified customers who are particularly extreme in their behaviour. 
  • A node could look very good in terms of containing a high proportion of responders, but these could all be people who are unclassified on that dimension (e.g. “Occupation = Unknown”).  This is not very helpful in describing behaviour.  Using Keep Separate would ensure that these people are in a separate branch

Keep Separate

People with unclassified values are still included in the Decision Tree.  However, if the variable chosen for use at a split contains people with unclassified values, then a split is created that forces these people in to a separate node.

 

  • It is likely that the same variable will then be chosen at the subsequent split, this time creating a rule based on the classified categories.
  • In the case where it was only the people with an unclassified value that were driving the selection of that variable, it is likely that another variable will be used for splitting once these people have been removed.
  • Note that having isolated people with say an unclassified value in one variable (perhaps Town), these people can be split further using other variables (for example, TV region).  This is not the case for the Omit Unclassified option.

 

Omit Unclassified

All people are included in the root node, but as each variable is used in a split, people with an unclassified value for that variable are omitted from the tree.

 

  • This is similar to the Keep Separate option, except that you don’t see the unclassified side shoot, and it cannot be used in further splits.

Low

People with unclassified values can only be grouped with the lowest category (e.g. Lowest income band), i.e. the category with the first code.

 

It is still possible for unclassified values to form a branch on their own if this creates a better split.

High

People with unclassified values can only be grouped with the highest category (e.g. Highest income band), i.e. the category with the last code.

 

It is still possible for unclassified values to form a branch on their own if this creates a better split.

Either End

People with unclassified values can be grouped with either the lowest or the highest category (e.g. Extreme income bands), i.e. the category with the first or last code.

 

It is still possible for unclassified values to form a branch on their own if this creates a better split.

Selector Branches

There are a number of ways of creating branches based on the categories within a selector.  The use of these options typically depends on whether the selector is ordinal or nominal.

  • A nominal variable is one where there is no meaningful order to the categories as they appear in the selector (e.g. Occupation).  A variable such as Region is nominal, despite the fact that there is a “spatial order”, since there is no significance in the order within the selector variable (i.e. based on their codes).

  • An ordinal variable is one where there is some order to the categories as they are presented in the selector (e.g. Income bands).

The options for handling selectors are as follows:

Unclassified Handling

Description

Mixed Categories

This is the default option.

 

Branches can contain any combination of categories. 

 

  • This is the most sensible option for use with nominal variables.
  • It is helpful to use this option initially for ordinal variables.  This will allow grouping of categories within branches to be driven statistically.  You can then see what the natural groupings are, and then impose one of the other options to superimpose the sequence of categories.

Ordered

This ensures that each node always contains consecutive categories.

 

For Binary (Mean PWE Split) this is done by forming the two branches from categories either side of the mean PWE value.  This was previously known as the "Single Cut" option.

Cyclic

This again imposes a restriction on the split to keep consecutive categories together, but allows the lowest and highest categories within the data to be joined together.

 

For example, the highest and lowest income bands across the whole data could be grouped together.