7. Controlling selection structure and tables¶

In this part of the tutorial we’ll learn how to control the way a selection is structured and joined together and how to change which table the selection counts.

Joining more than two selections together¶

We can use the & | ~ operators we saw in the previous part to build more complex selections made of multiple parts:

>>> low_price_deals_audience = eligible_for_discount & ~high_earners

We could have even written the eligible_for_discount variable directly using its constituent parts:

>>> low_price_deals_audience = (student | under_21) & ~high_earners

Here we’ve had to use parentheses so that the student | under_21 gets combined first, before the selection resulting from that is combined with ~high_earners . Without parentheses, Python’s operator precedence rules mean this would be calculated as:

>>> not_what_we_meant = student | (under_21 & ~high_earners)  # since & takes precedence over |

Even if operator precedence means that your selection would resolve as intended without parentheses, it’s probably sensible to include them to be explicit and improve readability:

>>> either_of_two_pairs = (student & smiths) | (high_earner & under_21)

The & operator ‘binds’ more tightly than the | operator, so this selection would resolve in the same way even if the parentheses were omitted. But including them makes the logic easier to read and enables you to communicate your intent to anyone else reading your code.

Determining the resolve table¶

The resolve table simply refers to the FastStats system table that the selection is set to count records from. As mentioned in the previous part, this is determined automatically according to the following rules:

for a selection consisting of a single ‘clause’, it is the table that the variable in this clause belongs to
for a selection made from a combination of several ‘clauses’, it is the table of the first (i.e. left-most) clause
normal Python operator precendence applies, including expressions in parentheses being evaluated first

The following code demonstrates this:

>>> student = people["Occupation"] == "4"
>>> usa = bookings["Destination"] == "38"
>>> student.table_name
'People'
>>> usa.table_name
'Bookings'
>>> (student & usa).table_name
'People'
>>> (usa & student).table_name
'Bookings'

So if your selection uses elements from different tables, make sure you begin it with an element from the table you want to use for the overall count.

Changing the resolve table¶

We can also manually change the resolve table of a selection using the multiplication operator * with the table:

>>> been_to_usa = people * usa
>>> been_to_usa.count()
273879
>>> been_to_usa.table_name
'People'

Note

The table that we ‘multiply by’ needs to be a Table object. Using the string of the table name will not work.

Again, we can use parentheses to group different parts of the selection to control how it is structured:

>>> audience_1 = people * (usa & at_least_2k)
>>> audience_1.count()
12746
>>> audience_2 = (people * usa) & at_least_2k
>>> audience_2.count()
20098

audience_1 selects people who have any Booking to the USA costing at least £2000 — the usa and at_least_2k clauses are grouped together with parentheses, so a person must have a single Booking matching both criteria to be selected.

It is equivalent to this selection in FastStats:

audience_2 selects people who have any Booking to the USA, and have any Booking costing at least £2000. The difference is that the conditions don’t have to apply to the same booking — the person’s Booking to the USA could cost less than £2000, as long as they have another Booking that does cost at least that much.

Here’s the equivalent selection in FastStats:

A worked example¶

Let’s just remind ourselves what audience_2 looked like and work through step-by-step how it’s evaluated, according to the rules above.

>>> audience_2 = (people * usa) & at_least_2k

(people * usa) is evaluated first because it’s in parentheses. usa is a condition on the Bookings table, but using the * operator on it with the People table manually changes it to resolve to the People table.

We could re-write this part as a new variable:

>>> audience_2 = people_to_usa & at_least_2k

Working left-to-right, people_to_usa is clearly a selection on the People table so at_least_2k is automatically adjusted to resolve to the People table to match. We could re-write this behaviour explicitly as:

>>> audience_2 = people_to_usa & (people * at_least_2k)

If we ‘unzip’ people_to_usa to its original form, we get:

>>> audience_2 = (people * usa) & (people * at_least_2k)

which mirrors the structure of the equivalent selection in FastStats shown above.

So far we’ve only interacted with our data by counting selections, but in the next part we’ll learn how we can use data grids to create an export of our data, which we can analyse further.