Decision Tree: Suggested Uses of Dimension Settings
Various options are available relating to how Dimensions are used. See How do I configure the dimensions for more details on how to do this. Below are some suggestions on how you could use these options.
Create and Use Split
Calculation of CHAID splits for high cardinality variables is time consuming. You can turn off the Creation of splits for these variables to save time. If the variable looks promising you can either create a CHAID split for a particular node or you can define the branches manually. See How do I manually control the build process? for more details.
Build the initial stages of a tree using high level variables and later stages with low level variables.
-
For example, build the initial stages with high level variables such as Region or 2-digit SIC code (low cardinality) set to "Use Split" and low level variables such as Town or 4-digit SIC with "Use Split" set to False. Then grow particular sections of the tree further, using the lower level variables instead. This will enable you to gain a high level understanding of all areas and a detailed understanding of particular areas.
Unclassifieds
Build the tree initially with Keep Separate and examine the significance of the unclassified branches using the Organic Tree.
-
You will get lots of side shoot branches on the Organic Tree, where unclassified values for each variable are separated off. The width of these will show you the volumes of unclassified data. The angle of the unclassified side shoot will show you how significantly the unclassified people differ from the rest.
-
If the unclassified branches are relatively small and low significance, you may decide it is OK to include them with the bulk of the data. In this case, re-run the Decision Tree using the Free Floating option. This will mean that people with say unknown income (but perhaps known occupation) can be considered alongside people with known income bands, such that they can contribute to the decision for a further split based on occupation. There may be a business reason why you wish to restrict the unclassifieds to be considered as low / high / either.
-
If the unclassified branches are large or of high significance, it would be dangerous to use the Free-Floating option, as clearly these people are different and would confuse any decisions (splits) should they be combined with the rest of the customers. Instead, continue with the Keep Separate option, and perhaps further splits can be made within a group of say unknown income based on data that you do have for these people.
Selector Branches
For nominal variables, use the Mixed Categories option. The other options are available in case an ordinal variable has been set up as nominal.
For ordinal variables, use the Mixed Categories option to first explore which variables contain any information relating to the analysis selection.
-
You can then examine the splits proposed for variables that are used in the Decision Tree. If categories within the nodes are almost conforming to the contiguous groups, you can re-run the tree imposing the Ordered or Cyclic options.
-
For example, if a node is formed using Mixed Categories with “YOB = 1931, 1932, 1933, 1935, 1936…” (i.e. 1934 is not in this node), you could conclude that the data is confirming the natural conception that people of a similar age behave in a similar way and that the fact that 1934 is not in this group is perhaps just an anomaly. In this case you would re-run the Decision Tree using say Mid-Point Split.
-
On the other hand, you might have suspected that consecutive income bands should be grouped contiguously, but the natural grouping using Mixed Categories is so mixed, you decide not to enforce this preconception.