Cube¶
Introduction¶
The Cube
object corresponds to a FastStats Cube
and is a multi-dimensional numeric analysis of FastStats data.
The cube is specified by:
a resolve table – statistics on the cube are measures of records in this table
a set of dimensions – these are the fields over which the data is being analysed
a set of measures – these define the statistics displayed on the cube to analyse the data
a base selection (optional) – this filters records in the table to only include in the analysis ones that match the given criteria
Note
Currently, the cube dimensions must be Selector variables or banded Date variables. Other variable types and more complex types like expressions or selections are not yet supported.
Basic use¶
Setting up variables:
>>> people = my_session.tables["People"]
>>> occupation = people["Occupation"]
>>> income = people["Income"]
>>> gender = people["Gender"]
Creating a cube:
>>> cube = people.cube([occupation, income, gender])
Converting to a Pandas DataFrame
:
>>> df = cube.to_df()
>>> df.head(10)
People
Occupation Income Gender
Manual Worker <£10k Female 15624
Male 5321
Unknown 5
£10-20k Female 43051
Male 5992
Unknown 25
£20-30k Female 1498
Male 649
Unknown 14
£30-40k Female 675
Pivoting the Occupation
dimension to make it easier to read:
>>> df.unstack(level=0)
People ...
Occupation Director Manager ... Student Unemployed
Income Gender ...
<£10k Female 1279 4649 ... 28002 21385
Male 832 2926 ... 14296 8386
Unknown 4 16 ... 10 155
£10-20k Female 4116 16665 ... 39462 17230
Male 2139 9123 ... 17917 4532
Unknown 9 47 ... 25 368
£100k+ Female 2 1 ... 2 0
Male 1 0 ... 3 0
Unknown 1 0 ... 1 0
£20-30k Female 1267 6238 ... 6669 5747
Male 1050 5315 ... 5274 1345
Unknown 5 45 ... 22 236
£30-40k Female 1591 6621 ... 5690 3117
Male 1940 9713 ... 6345 1049
Unknown 46 140 ... 63 519
£40-50k Female 265 965 ... 587 262
Male 518 1800 ... 943 115
Unknown 22 58 ... 29 110
£50-60k Female 336 806 ... 425 277
Male 607 1677 ... 692 69
Unknown 47 88 ... 64 89
£60-70k Female 40 112 ... 54 58
Male 96 220 ... 95 8
Unknown 11 16 ... 17 17
£70-80k Female 44 96 ... 42 27
Male 102 179 ... 63 5
Unknown 12 22 ... 15 5
£80-90k Female 11 11 ... 3 0
Male 14 13 ... 16 0
Unknown 4 3 ... 5 0
£90-100k Female 1 0 ... 1 1
Male 11 7 ... 4 0
Unknown 3 6 ... 9 0
[33 rows x 10 columns]
Using banded Date variables as dimensions:
>>> bookings = my_session.tables["Bookings"]
>>> booking_date = bookings["Booking Date"]
>>> travel_date = bookings["Travel Date"]
>>> booking_travel_cube = bookings.cube([booking_date.month, travel_date.year])
>>> booking_travel_df = booking_travel_cube.to_df()
>>> booking_travel_df.unstack(level=1)
Bookings
Travel Date (Year) 2016 2017 2018 2019 2020 2021 2022
Booking Date (Month)
2016-01 2668 0 0 0 0 0 0
2016-02 24482 0 0 0 0 0 0
2016-03 28435 0 0 0 0 0 0
2016-04 24360 0 0 0 0 0 0
2016-05 22757 0 0 0 0 0 0
... ... ... ... ... ... ...
2021-08 0 0 0 0 0 39880 0
2021-09 0 0 0 0 0 34380 407
2021-10 0 0 0 0 0 21537 6824
2021-11 0 0 0 0 0 12078 14308
2021-12 0 0 0 0 0 2938 16737
[72 rows x 7 columns]
Using a base selection to filter the records:
>>> student = occupation == "4"
>>> student_cube = student.cube([occupation, income, gender])
>>> student_df = student_cube.to_df()
>>> student_df.head()
People
Occupation Income Gender
Manual Worker <£10k Female 0
Male 0
Unknown 0
£10-20k Female 0
Male 0
Selecting only cells where Occupation
is Student,
and pivoting Gender
dimension:
>>> student_df.loc["Student"].unstack(level=1)
People
Gender Female Male Unknown
Income
<£10k 28002 14296 10
£10-20k 39462 17917 25
£100k+ 2 3 1
£20-30k 6669 5274 22
£30-40k 5690 6345 63
£40-50k 587 943 29
£50-60k 425 692 64
£60-70k 54 95 17
£70-80k 42 63 15
£80-90k 3 16 5
£90-100k 1 4 9
Using a base selection from a different table:
>>> households = my_session.tables["Households"]
>>> region = households["hoRegion"]
>>> manchester = region == "13"
>>> manc_cube = manchester.cube([occupation, region, gender], table=people)
>>> manc_df = manc_cube.to_df()
>>> manc_df.loc["Manager"].unstack(level=1)
People
Gender Female Male Unknown
Region
Channel Islands 0 0 0
East Anglia 0 0 0
East Midlands 0 0 0
Greater Manchester 1554 1539 26
North 0 0 0
North West (Excluding Gtr Manchester) 0 0 0
Northern Ireland 0 0 0
Scotland 0 0 0
South East (Inside M25 ) 0 0 0
South East (Outside M25 ) 0 0 0
South West 0 0 0
Wales 0 0 0
West Midlands 0 0 0
Yorkshire and Humber 0 0 0
Adding different measures:
>>> from apteco.statistics import Sum, Mean
>>> cost = bookings["Cost"]
>>> profit = bookings["Profit"]
>>> finance_cube = bookings.cube([dest, gender], measures=[Mean(cost), Sum(profit)])
>>> finance_df = finance_cube.to_df()
>>> finance_df.unstack(1)
Mean(Cost) Sum(Profit)
Gender Female Male Unknown Female Male Unknown
Destination
Australia 641.33 642.80 641.63 4969609.27 9289372.33 338911.11
Denmark 636.56 552.11 673.20 77696.43 7172.64 11138.75
France 644.34 645.12 NaN 28028207.53 12743135.05 0.00
Germany 643.13 688.66 739.48 41693688.99 49751.52 2396.78
Greece 644.01 643.99 652.30 3969380.08 7381942.89 259677.61
Italy 638.27 634.36 658.72 3362435.25 502193.53 402007.87
Jamaica 597.36 468.56 807.87 1770.04 290.77 15.51
Kuwait 650.15 645.65 659.84 298712.85 554525.03 67014.61
Latvia 693.92 618.84 712.12 11177.66 19882.72 2592.66
Mali 596.55 661.02 771.80 18392.77 40416.47 4736.45
Mongolia 579.34 636.74 677.02 4193.65 7141.89 2213.89
Namibia 704.16 638.48 542.54 31981.71 57459.89 3439.89
New Zealand 633.99 641.36 625.06 17982.15 33741.93 4577.55
Portugal 636.00 650.95 647.02 890531.92 554791.39 229951.78
Senegal 656.55 658.48 518.11 60939.12 28611.82 946.95
Sierra Leone 614.09 652.47 597.09 165479.74 353473.04 33155.92
South Africa 682.05 694.72 748.89 41997.60 84264.98 6911.85
Sweden 641.35 644.49 652.47 1232007.22 2296749.68 57618.93
United States 638.56 640.62 632.76 25279636.01 46492373.17 7632493.15
API reference¶
Cube creation and conversion¶
-
class
apteco.cube.
Cube
(dimensions, measures=None, selection=None, table=None, *, session=None)¶ Create a cube.
Tip
The
cube()
methods on tables and selections are wrappers around this class. It is recommended to prefer those over instantiating this class directly, as they generally provide a simpler interface.- Parameters
dimensions (list[Variable]) – Variables to use as dimensions in the cube. These must be from table or from a ‘related’ table – either an ancestor or descendant.
measures (list) – Statistics to display in the cube. These must be from table or from a ‘related’ table – either an ancestor or descendant. If measures is None, the count measure of the cube’s resolve table will be used by default.
selection (Clause) – Optional base selection to apply to the cube. The table of this selection must be table or a ‘related’ table – either an ancestor or descendant.
table (Table) – Resolve table of the cube. This table’s records are used in the analysis for the cube, e.g. the default count measure is a count of records from this table.
session (Session) – Current Apteco API session.
As well as being related to table, the following restrictions apply to dimensions and measures:
All dimensions must be from tables related to each other, except in the case of a ‘cross cube’ when dimensions may be from unrelated tables, as long as these are all descendants of table.
Each measure’s table must be related to each dimension’s table. In the case of a ‘cross cube’, all measures must be from table or one of its ancestors.
At least one of selection or table must be given:
If only selection is given, then table will be set to the resolve table of the selection.
If both are given and the resolve table of selection isn’t table, then the records used in the cube are determined by mapping the selection to the required table by applying ANY/THE logic as necessary. This matches the behaviour when applying an underlying selection to a cube in the FastStats application. The mapping described here happens in the FastStats data engine and does not change the selection on the
Cube
.
Tip
The following two cubes are equivalent:
>>> cube1 = Cube( ... dimensions, ... selection=manchester, ... table=people, ... session=my_session, ... ) >>> cube2 = Cube( ... dimensions, ... selection=(people * manchester), ... session=my_session, ... )
They both return a cube counting people from households in the Greater Manchester region.
Note
The raw cube data is fetched from the Apteco API when the
Cube
object is initialised. It is held on the object in the_data
attribute as a Numpyarray
but this is not considered public, and so to work with the data you should convert it to your desired output format. The format currently supported is a PandasDataFrame
, via theto_df()
method.-
to_df
(unclassified=False, totals=False, no_trans=False, convert_index=None)¶ Return the cube as a Pandas
DataFrame
. This is configured such that:the dimensions form the index. If multiple dimensions are given, this is a
MultiIndex
with each level corresponding to a dimension.there is one column for each measure.
- Parameters
unclassified (bool) – Whether to include unclassified rows in the DataFrame. Default is False.
totals (bool) – Whether to include totals rows in the DataFrame. Default is False.
no_trans (bool) – Whether to include rows counting records with no transactions; applicable when at least one dimension belongs to a child table. Included for forwards-compatibility, but not currently implemented. Must be left as False.
convert_index (bool) – Whether to convert the index to the corresponding ‘natural’ Pandas index type, or leave as a plain
Index
with strings as labels. Conversion isn’t possible if unclassified or totals is True. Default behaviour is to convert if possible.
Tip
The structure of the DataFrame returned by the
to_df()
method is very similar to a Tree in the FastStats application.See also
For more details on working with a Pandas DataFrame with a MultiIndex, see the user guide in the official Pandas documentation.
Dimensions¶
This section lists the various objects that can be applied as dimensions on a cube.
It also details their behaviour when the cube is transformed into a pandas DataFrame
under the to_df()
method.
Selector variables¶
Selector variables can be used directly as cube dimensions, though this doesn’t include any selector sub-types, such as Array or Flag Array variables. (The exception to this is Date and DateTime variables, which can be used as dimensions when a banding is applied – see the section below.)
Conversion to a pandas DataFrame:
The index is left as a standard pandas
Index
.The index labels are the dimension category descriptions.
The index name is the variable description.
Banded Date variables¶
Date and DateTime variables can be used as cube dimensions
when banded up to a particular time period.
These bandings are accessed via attributes on the DateVariable
or DateTimeVariable
object.
The following bandings are currently supported:
DateVariable.day
DateVariable.month
DateVariable.quarter
DateVariable.year
Conversion to a pandas DataFrame:
The default index conversion is to a pandas
PeriodIndex
with the corresponding frequency (see table below).If not converted, the index labels are datetime strings (see table table below).
The index name is of the form ‘<Variable description> (<banding>)’.
Banding |
Frequency string |
Examples (unconverted string labels) |
|
---|---|---|---|
|
Day |
|
|
|
Month |
|
|
|
Quarter |
|
|
|
Year |
|
|
Statistics¶
Two types of statistics are currently supported as cube measures: table counts and variable statistics.
Table counts¶
These can be specified by passing a Table object in the measures list, and will return a count of the records from that Table. If measures is None, the count of records from the resolve table will be used by default.
Variable statistics¶
These are summary statistics over a given variable and can be specified
using the classes available in the apteco.statistics
module.
The statistics all have the same signature:
-
class
apteco.statistics.
Statistic
(operand, *, label=None)¶ Create a variable statistic.
- Parameters
operand (Variable) – Variable over which to apply the statistic. Most statistics can only be calculated over numeric variables, but some also accept selector variables. See details below for specific restrictions.
label (str, optional) – Descriptive name for this statistic. Used as the column label for this statistic on the DataFrame returned by
to_df()
.
Selector or Numeric variable¶
These statistics accept either a selector or numeric variable as the operand:
-
class
apteco.statistics.
CountMode
¶ The number of records which take the modal value of the variable.
-
class
apteco.statistics.
CountDistinct
¶ The number of distinct values of the variable.
Numeric variable¶
These statistics accept a numeric variable as the operand:
-
class
apteco.statistics.
Sum
¶ The sum of values of the variable.
-
class
apteco.statistics.
Min
¶ The minimum value of the variable.
-
class
apteco.statistics.
Max
¶ The maximum value of the variable.
-
class
apteco.statistics.
Populated
¶ The number of records for which the variable has a (non-missing) value.
-
class
apteco.statistics.
Mode
¶ The mode (most common) value of the variable.
-
class
apteco.statistics.
Mean
¶ The mean value of the variable.
-
class
apteco.statistics.
StdDev
¶ The standard deviation of the variable.
-
class
apteco.statistics.
Variance
¶ The variance of the variable.
-
class
apteco.statistics.
Median
¶ The median value of the variable.
-
class
apteco.statistics.
LowerQuartile
¶ The lower quartile of the variable.
-
class
apteco.statistics.
UpperQuartile
¶ The upper quartile of the variable.
-
class
apteco.statistics.
InterQuartileRange
¶ The inter-quartile range of the variable.