9. Creating Cubes

A cube is a numeric analysis showing data broken down by one or more dimensions. In this part of the tutorial we’ll learn how to create cubes, and how to control what data is included in them.

Creating a cube

We can use the cube() method on a table to create a cube analysing records from that table. We pass it a list of the FastStats variables we want to use as cube dimensions:

>>> occupation = people["Occupation"]
>>> income = people["Income"]
>>> gender = people["Gender"]
>>> cube = people.cube([occupation, income, gender])

If we don’t specify any measures, the count measure is used by default.

To work with the data we can convert it to a Pandas DataFrame using the to_df() method:

>>> df = cube.to_df()
>>> df
                                People
Occupation    Income   Gender
Manual Worker <£10k    Female    15624
                       Male       5321
                       Unknown       5
              £10-20k  Female    43051
                       Male       5992
                                ...
Retired       £90-100k Male          2
                       Unknown       0
              £100k+   Female        0
                       Male          1
                       Unknown       0

[330 rows x 1 columns]

The DataFrame uses a MultiIndex, with each dimension as a different level in the index:

>>> df.index
MultiIndex([('Manual Worker',    '<£10k',  'Female'),
        ('Manual Worker',    '<£10k',    'Male'),
        ('Manual Worker',    '<£10k', 'Unknown'),
        ('Manual Worker',  '£10-20k',  'Female'),
        ('Manual Worker',  '£10-20k',    'Male'),
        ('Manual Worker',  '£10-20k', 'Unknown'),
        ('Manual Worker',  '£20-30k',  'Female'),
        ('Manual Worker',  '£20-30k',    'Male'),
        ('Manual Worker',  '£20-30k', 'Unknown'),
        ('Manual Worker',  '£30-40k',  'Female'),
        ...
        (      'Retired',  '£70-80k', 'Unknown'),
        (      'Retired',  '£80-90k',  'Female'),
        (      'Retired',  '£80-90k',    'Male'),
        (      'Retired',  '£80-90k', 'Unknown'),
        (      'Retired', '£90-100k',  'Female'),
        (      'Retired', '£90-100k',    'Male'),
        (      'Retired', '£90-100k', 'Unknown'),
        (      'Retired',   '£100k+',  'Female'),
        (      'Retired',   '£100k+',    'Male'),
        (      'Retired',   '£100k+', 'Unknown')],
       names=['Occupation', 'Income', 'Gender'], length=330)

The DataFrame has one column which is the default count measure, named after the entity it is counting. Since the data values represent a count, they are all integers:

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 330 entries, ('Manual Worker', '<£10k', 'Female') to ('Retired', '£100k+', 'Unknown')
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   People  330 non-null    int64
dtypes: int64(1)
memory usage: 4.6+ KB

More generally, the DataFrame will have one column for each measure included on the cube.

Tip

This initial structure of the DataFrame returned by the to_df() method is very similar to a Tree in the FastStats application.

Pivoting the Gender dimension returns a DataFrame with Gender as a new level of column labels. You may find this arrangement easier to read and more similar to how a cube would be presented in FastStats:

>>> df.unstack(level=2)
                    People
Gender              Female  Male Unknown
Occupation Income
Director   <£10k      1279   832       4
           £10-20k    4116  2139       9
           £100k+        2     1       1
           £20-30k    1267  1050       5
           £30-40k    1591  1940      46
                    ...   ...     ...
Unemployed £50-60k     277    69      89
           £60-70k      58     8      17
           £70-80k      27     5       5
           £80-90k       0     0       0
           £90-100k      1     0       0

[110 rows x 3 columns]

As well as Selector variables, you can also use banded Date or Datetime variables as cube dimensions. These bandings are accessible as attributes on the variable:

>>> dest = bookings["Destination"]
>>> booking_date = bookings["Booking Date"]
>>> booking_year_by_dest = bookings.cube([booking_date.year, dest])
>>> year_dest_df = booking_year_by_dest.to_df()
>>> year_dest_df.unstack(level=0)
                    Bookings
Booking Date (Year)     2016   2017    2018   2019    2020    2021
Destination
Australia              28747  32766   68271  46576   54519   51840
Denmark                  158    169     355    244     299     285
France                 51950  60253  125838  85903  102543   97987
Germany                47031  53691  109053  75120   89974   86391
Greece                 11220  12999   27145  18558   21797   20373
Italy                   3557   4272    8704   6065    7200    7083
Jamaica                   19     14      42     32      38      34
Kuwait                  2373   2765    5546   3907    4550    4407
Latvia                    69     78     168    107     102     121
Mali                     104    122     215    136     209     188
Mongolia                  24     17      38     15      32      28
Namibia                  250    254     553    385     476     431
New Zealand               93    109     227    113     178     159
Portugal                2222   2514    5124   3560    4184    4028
Senegal                  111    109     238    183     159     181
Sierra Leone             466    613    1081    787     938     902
South Africa              88    119     208    164     199     178
Sweden                  2460   2915    6120   4112    4929    4671
United States          58691  64546  134846  92003  108080  103676

The supported bandings are: year, quarter, month, day

Banded Date variables use a PeriodIndex in the outputted Dataframe:

>>> travel_date = bookings["Travel Date"]
>>> travel_month_df = bookings.cube([travel_date.month]).to_df()
>>> travel_month_df
                     Bookings
Travel Date (Month)
2016-01                   410
2016-02                  4059
2016-03                 12233
2016-04                 20175
2016-05                 25601
                       ...
2022-08                     0
2022-09                     0
2022-10                     0
2022-11                     0
2022-12                     0

[84 rows x 1 columns]
>>> travel_month_df.index
PeriodIndex(['2016-01', '2016-02', '2016-03', '2016-04', '2016-05', '2016-06',
             '2016-07', '2016-08', '2016-09', '2016-10', '2016-11', '2016-12',
             '2017-01', '2017-02', '2017-03', '2017-04', '2017-05', '2017-06',
             '2017-07', '2017-08', '2017-09', '2017-10', '2017-11', '2017-12',
             '2018-01', '2018-02', '2018-03', '2018-04', '2018-05', '2018-06',
             '2018-07', '2018-08', '2018-09', '2018-10', '2018-11', '2018-12',
             '2019-01', '2019-02', '2019-03', '2019-04', '2019-05', '2019-06',
             '2019-07', '2019-08', '2019-09', '2019-10', '2019-11', '2019-12',
             '2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06',
             '2020-07', '2020-08', '2020-09', '2020-10', '2020-11', '2020-12',
             '2021-01', '2021-02', '2021-03', '2021-04', '2021-05', '2021-06',
             '2021-07', '2021-08', '2021-09', '2021-10', '2021-11', '2021-12',
             '2022-01', '2022-02', '2022-03', '2022-04', '2022-05', '2022-06',
             '2022-07', '2022-08', '2022-09', '2022-10', '2022-11', '2022-12'],
            dtype='period[M]', name='Travel Date (Month)', freq='M')

As in FastStats, you can also use variables from ‘related’ tables, that is, ancestor or descendant tables (including the direct parent and children):

>>> region = households["Region"]
>>> continent = bookings["Continent"]
>>> mixed_tables_cube = people.cube([region, occupation, continent])
>>> mixed_tables_cube.to_df()
                                           People
Region          Occupation    Continent
North           Manual Worker Australasia     101
                              Europe         4158
                              Americas         96
                              Asia              2
                              Africa            2
                                           ...
Channel Islands Retired       Australasia       7
                              Europe           18
                              Americas         10
                              Asia              2
                              Africa            2

[700 rows x 1 columns]

Note

This is the table structure for the tables from the demo Holidays system being used here:

Households
 └─ People
     └─ Bookings

Our cube is showing a count of People records, but the dimensions include the Region variable from the parent Households table, as well as the Continent variable from the child Bookings table.

Controlling DataFrame output

By default, the DataFrame returned by to_df() doesn’t include unclassified values or totals. But these can be included by setting the unclassified and totals arguments (respectively) for to_df() to True:

>>> gender_cube = people.cube([gender])
>>> gender_cube.to_df()
         People
Gender
Female   764796
Male     378567
Unknown   13190
>>> gender_cube.to_df(unclassified=True, totals=True)
               People
Gender
Unclassified        0
Female         764796
Male           378567
Unknown         13190
TOTAL         1156553

If any of these extra values are included for a cube with a Banded Date dimension, the corresponding DataFrame index cannot be converted to a PeriodIndex so will be left as a plain Index:

>>> booking_year_cube = bookings.cube([booking_date.year])
>>> booking_year_cube.to_df().index
PeriodIndex(['2016', '2017', '2018', '2019', '2020', '2021'], dtype='period[A-DEC]', name='Booking Date (Year)', freq='A-DEC')
>>> booking_year_cube.to_df(unclassified=True).index
Index(['Unclassified', '2016', '2017', '2018', '2019', '2020', '2021'], dtype='object', name='Booking Date (Year)')

To stop the index being automatically converted to a PeriodIndex even when these extra values aren’t included, set the convert_index argument to False:

>>> booking_year_cube.to_df(convert_index=False).index
Index(['2016', '2017', '2018', '2019', '2020', '2021'], dtype='object', name='Booking Date (Year)')

Applying a selection

You can filter the records used in the analysis for the cube by applying a selection to it using the selection parameter:

>>> student = people["Occupation"] == "4"
>>> student_cube = people.cube([occupation, region, continent], selection=student)
>>> student_cube.to_df()
                                           People
Occupation    Region          Continent
Manual Worker North           Australasia       0
                              Europe            0
                              Americas          0
                              Asia              0
                              Africa            0
                                           ...
Retired       Channel Islands Australasia       0
                              Europe            0
                              Americas          0
                              Asia              0
                              Africa            0

[700 rows x 1 columns]

You can also build the cube directly from the selection using the cube() method:

>>> student_cube = student.cube([occupation, region, continent])
>>> student_cube.to_df()
                                           People
Occupation    Region          Continent
Manual Worker North           Australasia       0
                              Europe            0
                              Americas          0
                              Asia              0
                              Africa            0
                                           ...
Retired       Channel Islands Australasia       0
                              Europe            0
                              Americas          0
                              Asia              0
                              Africa            0

[700 rows x 1 columns]

(this cube is identical to the previous one)

When creating a data grid from a selection, the table for the data grid is automatically set as the selection’s table.

Applying a selection from a different table

Just as in FastStats, you can apply a selection based on a table different from the one used in your cube:

>>> scotland = region == "10"
>>> scotland_cube = people.cube([occupation, region, continent], selection=scotland)
>>> scotland_cube.to_df()
                                           People
Occupation    Region          Continent
Manual Worker North           Australasia       0
                              Europe            0
                              Americas          0
                              Asia              0
                              Africa            0
                                           ...
Retired       Channel Islands Australasia       0
                              Europe            0
                              Americas          0
                              Asia              0
                              Africa            0

[700 rows x 1 columns]

Here, scotland is a selection on the Households table, but we are applying it to a People cube.

Note

The selection’s table must be the same as the cube, or a ‘related’ table – either an ancestor or descendant.

Again, you can build the cube from the selection itself, but this time you will need to use the table parameter to set the cube to the desired table:

>>> scotland_cube = scotland.cube([occupation, region, continent], table=people)
>>> scotland_cube.to_df()
                                           People
Occupation    Region          Continent
Manual Worker North           Australasia       0
                              Europe            0
                              Americas          0
                              Asia              0
                              Africa            0
                                           ...
Retired       Channel Islands Australasia       0
                              Europe            0
                              Americas          0
                              Asia              0
                              Africa            0

[700 rows x 1 columns]

(this cube is identical to the previous one)

Adding different measures

The cube() method has a measures argument, which you can use to specify which measures appear on the cube. If this is not set, the default count measure is used automatically.

To display a count for another table pass its Table object in the measures list:

>>> dest_cube = bookings.cube([dest], measures=[bookings, people, households])
>>> dest_cube.to_df()
               Bookings  People  Households
Destination
Australia        282719  274857      194412
New Zealand         879     870         862
Denmark            1510    1497        1483
France           524474  522932      359194
Germany          461260  214163      212004
Greece           112092  109564       93208
Italy             36881   25958       25612
Latvia              645     551         542
Portugal          21632   21230       20723
Sweden            25207   25175       24055
Jamaica             179     178         175
United States    561842  273879      195757
Kuwait            23548   23388       22349
Mongolia            154     154         151
Mali                974     968         961
Namibia            2349    2276        2232
Senegal             981     777         766
Sierra Leone       4787    4778        4677
South Africa        956     954         945

(if you want the original count as well, you will now need to include that explicitly in the measures list)

Each measure is returned as a separate column on the DataFrame.

You can also add variable statistics which can be imported from the apteco.statistics module:

>>> from apteco.statistics import Sum, Mean
>>> cost = bookings["Cost"]
>>> profit = bookings["Profit"]
>>> finance_cube = bookings.cube([dest, gender], measures=[Mean(cost), Sum(profit)])
>>> finance_df = finance_cube.to_df()
>>> finance_df.unstack(1)
              Mean(Cost)                  Sum(Profit)
Gender            Female    Male Unknown       Female         Male     Unknown
Destination
Australia         641.33  642.80  641.63   4969609.27   9289372.33   338911.11
Denmark           636.56  552.11  673.20     77696.43      7172.64    11138.75
France            644.34  645.12     NaN  28028207.53  12743135.05        0.00
Germany           643.13  688.66  739.48  41693688.99     49751.52     2396.78
Greece            644.01  643.99  652.30   3969380.08   7381942.89   259677.61
Italy             638.27  634.36  658.72   3362435.25    502193.53   402007.87
Jamaica           597.36  468.56  807.87      1770.04       290.77       15.51
Kuwait            650.15  645.65  659.84    298712.85    554525.03    67014.61
Latvia            693.92  618.84  712.12     11177.66     19882.72     2592.66
Mali              596.55  661.02  771.80     18392.77     40416.47     4736.45
Mongolia          579.34  636.74  677.02      4193.65      7141.89     2213.89
Namibia           704.16  638.48  542.54     31981.71     57459.89     3439.89
New Zealand       633.99  641.36  625.06     17982.15     33741.93     4577.55
Portugal          636.00  650.95  647.02    890531.92    554791.39   229951.78
Senegal           656.55  658.48  518.11     60939.12     28611.82      946.95
Sierra Leone      614.09  652.47  597.09    165479.74    353473.04    33155.92
South Africa      682.05  694.72  748.89     41997.60     84264.98     6911.85
Sweden            641.35  644.49  652.47   1232007.22   2296749.68    57618.93
United States     638.56  640.62  632.76  25279636.01  46492373.17  7632493.15

See also

For more information on working with DataFrames with a MultiIndex, see the user guide in the official Pandas documentation.

That’s the end of the tutorial! Hopefully you’re now equipped with the knowledge you need to get started on building your own selections, data grids and cubes. Check out the rest of the documentation for more guidance, and if you have any questions don’t hesitate to get in touch with Apteco Support (support@apteco.com) who will be happy to help.