9. Creating Cubes¶
A cube is a numeric analysis showing data broken down by one or more dimensions. In this part of the tutorial we’ll learn how to create cubes, and how to control what data is included in them.
Creating a cube¶
We can use the cube()
method on a table
to create a cube analysing records from that table.
We pass it a list of the FastStats variables we want to use as cube dimensions:
>>> occupation = people["Occupation"]
>>> income = people["Income"]
>>> gender = people["Gender"]
>>> cube = people.cube([occupation, income, gender])
If we don’t specify any measures, the count measure is used by default.
To work with the data we can convert it to a Pandas DataFrame
using the to_df()
method:
>>> df = cube.to_df()
>>> df
People
Occupation Income Gender
Manual Worker <£10k Female 15624
Male 5321
Unknown 5
£10-20k Female 43051
Male 5992
...
Retired £90-100k Male 2
Unknown 0
£100k+ Female 0
Male 1
Unknown 0
[330 rows x 1 columns]
The DataFrame uses a MultiIndex, with each dimension as a different level in the index:
>>> df.index
MultiIndex([('Manual Worker', '<£10k', 'Female'),
('Manual Worker', '<£10k', 'Male'),
('Manual Worker', '<£10k', 'Unknown'),
('Manual Worker', '£10-20k', 'Female'),
('Manual Worker', '£10-20k', 'Male'),
('Manual Worker', '£10-20k', 'Unknown'),
('Manual Worker', '£20-30k', 'Female'),
('Manual Worker', '£20-30k', 'Male'),
('Manual Worker', '£20-30k', 'Unknown'),
('Manual Worker', '£30-40k', 'Female'),
...
( 'Retired', '£70-80k', 'Unknown'),
( 'Retired', '£80-90k', 'Female'),
( 'Retired', '£80-90k', 'Male'),
( 'Retired', '£80-90k', 'Unknown'),
( 'Retired', '£90-100k', 'Female'),
( 'Retired', '£90-100k', 'Male'),
( 'Retired', '£90-100k', 'Unknown'),
( 'Retired', '£100k+', 'Female'),
( 'Retired', '£100k+', 'Male'),
( 'Retired', '£100k+', 'Unknown')],
names=['Occupation', 'Income', 'Gender'], length=330)
The DataFrame has one column which is the default count measure, named after the entity it is counting. Since the data values represent a count, they are all integers:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 330 entries, ('Manual Worker', '<£10k', 'Female') to ('Retired', '£100k+', 'Unknown')
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 People 330 non-null int64
dtypes: int64(1)
memory usage: 4.6+ KB
More generally, the DataFrame will have one column for each measure included on the cube.
Tip
This initial structure of the DataFrame returned by the to_df()
method
is very similar to a Tree in the FastStats application.
Pivoting the Gender
dimension returns a DataFrame
with Gender
as a new level of column labels.
You may find this arrangement easier to read
and more similar to how a cube would be presented in FastStats:
>>> df.unstack(level=2)
People
Gender Female Male Unknown
Occupation Income
Director <£10k 1279 832 4
£10-20k 4116 2139 9
£100k+ 2 1 1
£20-30k 1267 1050 5
£30-40k 1591 1940 46
... ... ...
Unemployed £50-60k 277 69 89
£60-70k 58 8 17
£70-80k 27 5 5
£80-90k 0 0 0
£90-100k 1 0 0
[110 rows x 3 columns]
As well as Selector variables, you can also use banded Date or Datetime variables as cube dimensions. These bandings are accessible as attributes on the variable:
>>> dest = bookings["Destination"]
>>> booking_date = bookings["Booking Date"]
>>> booking_year_by_dest = bookings.cube([booking_date.year, dest])
>>> year_dest_df = booking_year_by_dest.to_df()
>>> year_dest_df.unstack(level=0)
Bookings
Booking Date (Year) 2016 2017 2018 2019 2020 2021
Destination
Australia 28747 32766 68271 46576 54519 51840
Denmark 158 169 355 244 299 285
France 51950 60253 125838 85903 102543 97987
Germany 47031 53691 109053 75120 89974 86391
Greece 11220 12999 27145 18558 21797 20373
Italy 3557 4272 8704 6065 7200 7083
Jamaica 19 14 42 32 38 34
Kuwait 2373 2765 5546 3907 4550 4407
Latvia 69 78 168 107 102 121
Mali 104 122 215 136 209 188
Mongolia 24 17 38 15 32 28
Namibia 250 254 553 385 476 431
New Zealand 93 109 227 113 178 159
Portugal 2222 2514 5124 3560 4184 4028
Senegal 111 109 238 183 159 181
Sierra Leone 466 613 1081 787 938 902
South Africa 88 119 208 164 199 178
Sweden 2460 2915 6120 4112 4929 4671
United States 58691 64546 134846 92003 108080 103676
The supported bandings are: year
, quarter
, month
, day
Banded Date variables use a PeriodIndex in the outputted Dataframe:
>>> travel_date = bookings["Travel Date"]
>>> travel_month_df = bookings.cube([travel_date.month]).to_df()
>>> travel_month_df
Bookings
Travel Date (Month)
2016-01 410
2016-02 4059
2016-03 12233
2016-04 20175
2016-05 25601
...
2022-08 0
2022-09 0
2022-10 0
2022-11 0
2022-12 0
[84 rows x 1 columns]
>>> travel_month_df.index
PeriodIndex(['2016-01', '2016-02', '2016-03', '2016-04', '2016-05', '2016-06',
'2016-07', '2016-08', '2016-09', '2016-10', '2016-11', '2016-12',
'2017-01', '2017-02', '2017-03', '2017-04', '2017-05', '2017-06',
'2017-07', '2017-08', '2017-09', '2017-10', '2017-11', '2017-12',
'2018-01', '2018-02', '2018-03', '2018-04', '2018-05', '2018-06',
'2018-07', '2018-08', '2018-09', '2018-10', '2018-11', '2018-12',
'2019-01', '2019-02', '2019-03', '2019-04', '2019-05', '2019-06',
'2019-07', '2019-08', '2019-09', '2019-10', '2019-11', '2019-12',
'2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06',
'2020-07', '2020-08', '2020-09', '2020-10', '2020-11', '2020-12',
'2021-01', '2021-02', '2021-03', '2021-04', '2021-05', '2021-06',
'2021-07', '2021-08', '2021-09', '2021-10', '2021-11', '2021-12',
'2022-01', '2022-02', '2022-03', '2022-04', '2022-05', '2022-06',
'2022-07', '2022-08', '2022-09', '2022-10', '2022-11', '2022-12'],
dtype='period[M]', name='Travel Date (Month)', freq='M')
As in FastStats, you can also use variables from ‘related’ tables, that is, ancestor or descendant tables (including the direct parent and children):
>>> region = households["Region"]
>>> continent = bookings["Continent"]
>>> mixed_tables_cube = people.cube([region, occupation, continent])
>>> mixed_tables_cube.to_df()
People
Region Occupation Continent
North Manual Worker Australasia 101
Europe 4158
Americas 96
Asia 2
Africa 2
...
Channel Islands Retired Australasia 7
Europe 18
Americas 10
Asia 2
Africa 2
[700 rows x 1 columns]
Note
This is the table structure for the tables from the demo Holidays system being used here:
Households
└─ People
└─ Bookings
Our cube is showing a count of People records, but the dimensions include the Region variable from the parent Households table, as well as the Continent variable from the child Bookings table.
Controlling DataFrame output¶
By default, the DataFrame returned by to_df()
doesn’t include
unclassified values or totals.
But these can be included by setting the unclassified
and totals
arguments
(respectively) for to_df()
to True
:
>>> gender_cube = people.cube([gender])
>>> gender_cube.to_df()
People
Gender
Female 764796
Male 378567
Unknown 13190
>>> gender_cube.to_df(unclassified=True, totals=True)
People
Gender
Unclassified 0
Female 764796
Male 378567
Unknown 13190
TOTAL 1156553
If any of these extra values are included for a cube with a Banded Date dimension,
the corresponding DataFrame index cannot be converted to a PeriodIndex
so will be left as a plain Index
:
>>> booking_year_cube = bookings.cube([booking_date.year])
>>> booking_year_cube.to_df().index
PeriodIndex(['2016', '2017', '2018', '2019', '2020', '2021'], dtype='period[A-DEC]', name='Booking Date (Year)', freq='A-DEC')
>>> booking_year_cube.to_df(unclassified=True).index
Index(['Unclassified', '2016', '2017', '2018', '2019', '2020', '2021'], dtype='object', name='Booking Date (Year)')
To stop the index being automatically converted to a PeriodIndex
even when these extra values aren’t included,
set the convert_index
argument to False
:
>>> booking_year_cube.to_df(convert_index=False).index
Index(['2016', '2017', '2018', '2019', '2020', '2021'], dtype='object', name='Booking Date (Year)')
Applying a selection¶
You can filter the records used in the analysis for the cube by applying a selection to it using the selection parameter:
>>> student = people["Occupation"] == "4"
>>> student_cube = people.cube([occupation, region, continent], selection=student)
>>> student_cube.to_df()
People
Occupation Region Continent
Manual Worker North Australasia 0
Europe 0
Americas 0
Asia 0
Africa 0
...
Retired Channel Islands Australasia 0
Europe 0
Americas 0
Asia 0
Africa 0
[700 rows x 1 columns]
You can also build the cube directly
from the selection using the cube()
method:
>>> student_cube = student.cube([occupation, region, continent])
>>> student_cube.to_df()
People
Occupation Region Continent
Manual Worker North Australasia 0
Europe 0
Americas 0
Asia 0
Africa 0
...
Retired Channel Islands Australasia 0
Europe 0
Americas 0
Asia 0
Africa 0
[700 rows x 1 columns]
(this cube is identical to the previous one)
When creating a data grid from a selection, the table for the data grid is automatically set as the selection’s table.
Applying a selection from a different table¶
Just as in FastStats, you can apply a selection based on a table different from the one used in your cube:
>>> scotland = region == "10"
>>> scotland_cube = people.cube([occupation, region, continent], selection=scotland)
>>> scotland_cube.to_df()
People
Occupation Region Continent
Manual Worker North Australasia 0
Europe 0
Americas 0
Asia 0
Africa 0
...
Retired Channel Islands Australasia 0
Europe 0
Americas 0
Asia 0
Africa 0
[700 rows x 1 columns]
Here, scotland
is a selection on the Households table,
but we are applying it to a People cube.
Note
The selection’s table must be the same as the cube, or a ‘related’ table – either an ancestor or descendant.
Again, you can build the cube from the selection itself, but this time you will need to use the table parameter to set the cube to the desired table:
>>> scotland_cube = scotland.cube([occupation, region, continent], table=people)
>>> scotland_cube.to_df()
People
Occupation Region Continent
Manual Worker North Australasia 0
Europe 0
Americas 0
Asia 0
Africa 0
...
Retired Channel Islands Australasia 0
Europe 0
Americas 0
Asia 0
Africa 0
[700 rows x 1 columns]
(this cube is identical to the previous one)
Adding different measures¶
The cube()
method has a measures argument,
which you can use to specify which measures appear on the cube.
If this is not set, the default count measure is used automatically.
To display a count for another table pass its Table
object
in the measures list:
>>> dest_cube = bookings.cube([dest], measures=[bookings, people, households])
>>> dest_cube.to_df()
Bookings People Households
Destination
Australia 282719 274857 194412
New Zealand 879 870 862
Denmark 1510 1497 1483
France 524474 522932 359194
Germany 461260 214163 212004
Greece 112092 109564 93208
Italy 36881 25958 25612
Latvia 645 551 542
Portugal 21632 21230 20723
Sweden 25207 25175 24055
Jamaica 179 178 175
United States 561842 273879 195757
Kuwait 23548 23388 22349
Mongolia 154 154 151
Mali 974 968 961
Namibia 2349 2276 2232
Senegal 981 777 766
Sierra Leone 4787 4778 4677
South Africa 956 954 945
(if you want the original count as well, you will now need to include that explicitly in the measures list)
Each measure is returned as a separate column on the DataFrame.
You can also add variable statistics
which can be imported from the apteco.statistics
module:
>>> from apteco.statistics import Sum, Mean
>>> cost = bookings["Cost"]
>>> profit = bookings["Profit"]
>>> finance_cube = bookings.cube([dest, gender], measures=[Mean(cost), Sum(profit)])
>>> finance_df = finance_cube.to_df()
>>> finance_df.unstack(1)
Mean(Cost) Sum(Profit)
Gender Female Male Unknown Female Male Unknown
Destination
Australia 641.33 642.80 641.63 4969609.27 9289372.33 338911.11
Denmark 636.56 552.11 673.20 77696.43 7172.64 11138.75
France 644.34 645.12 NaN 28028207.53 12743135.05 0.00
Germany 643.13 688.66 739.48 41693688.99 49751.52 2396.78
Greece 644.01 643.99 652.30 3969380.08 7381942.89 259677.61
Italy 638.27 634.36 658.72 3362435.25 502193.53 402007.87
Jamaica 597.36 468.56 807.87 1770.04 290.77 15.51
Kuwait 650.15 645.65 659.84 298712.85 554525.03 67014.61
Latvia 693.92 618.84 712.12 11177.66 19882.72 2592.66
Mali 596.55 661.02 771.80 18392.77 40416.47 4736.45
Mongolia 579.34 636.74 677.02 4193.65 7141.89 2213.89
Namibia 704.16 638.48 542.54 31981.71 57459.89 3439.89
New Zealand 633.99 641.36 625.06 17982.15 33741.93 4577.55
Portugal 636.00 650.95 647.02 890531.92 554791.39 229951.78
Senegal 656.55 658.48 518.11 60939.12 28611.82 946.95
Sierra Leone 614.09 652.47 597.09 165479.74 353473.04 33155.92
South Africa 682.05 694.72 748.89 41997.60 84264.98 6911.85
Sweden 641.35 644.49 652.47 1232007.22 2296749.68 57618.93
United States 638.56 640.62 632.76 25279636.01 46492373.17 7632493.15
See also
For more information on working with DataFrames with a MultiIndex, see the user guide in the official Pandas documentation.
That’s the end of the tutorial! Hopefully you’re now equipped with the knowledge you need to get started on building your own selections, data grids and cubes. Check out the rest of the documentation for more guidance, and if you have any questions don’t hesitate to get in touch with Apteco Support (support@apteco.com) who will be happy to help.