Decision Tree: How are Flag-Arrays used?

A Flag Array variable contains a list of related Yes/No categories, any number of which can be set to Yes or No. For example, the Newspapers flag array in the holidays data contains a list of newspapers that a person might read. Any number of the newspaper category flags could be set to "Yes", to indicate which newspapers a person reads.

When a flag-array variable is used in a Decision Tree, the splits that are made only use one category at a time. In the example below, the first split is based on the "Record" category and creates the following nodes:

Node 1 Includes: Record = people who do read the Record.

Node 2 Does not include Record = people who do not read the Record

In the example above, the second split is based on the "Daily Mirror" category and creates the following nodes:

Include 3 Includes: Daily Mirror = people who do read the Daily Record and Daily Mirror

Include 4 Does not include: Daily Mirror = people who read the Daily Record but do not read the Daily Mirror

The combined rule for node 4 is displayed in the Focus Node panel. This node contains people who do read the Daily Record but do not read the Daily Mirror. They could of course read any number of other newspapers. Later splits on the same variable may well refine this rule.

How are the Flag Array splits created?

The Flag Array splits are always created in the same way, no matter whether PWE or Chaid is selected. This is because at present only binary Flag Array splits are supported.

The split simply finds the category which is most strongly related to the analysis selection (has the highest or lowest PWE). An "Include" or "Exclude" split is then created.

Why is my flag array variable not used?

The flag-array variable is considered along with the other dimensions, and may not be used if the other dimensions create better splits. How this decision is made is explained in How do I set the Algorithm Options? and How do I set the Stopping Conditions? If the flag array is used in the split, then the category which best identifies the analysis selection is used.

In the case of the Newspaper flag-array variable, the splits created are of very low significance and have a low Z-score (in other words, what newspaper you read has little bearing on whether you have been to Sweden!) Building a tree with the default stopping conditions actually prevents the Newspaper variable from being used.

To get the Decision Tree to use the Newspaper variable, you will have to turn off or relax the split validation conditions. Note that a tree just using the Newspaper variable is very poor at predicting people who go to Sweden. This is shown below by the fact that the Organic Tree is very straight and grey.