****************** Creating Cubes ****************** A cube is a numeric analysis showing data broken down by one or more dimensions. In this part of the tutorial we'll learn how to create cubes, and how to control what data is included in them. Creating a cube =============== We can use the :meth:`cube` method on a table to create a cube analysing records from that table. We pass it a list of the FastStats variables we want to use as cube dimensions:: >>> occupation = people["Occupation"] >>> income = people["Income"] >>> gender = people["Gender"] >>> cube = people.cube([occupation, income, gender]) If we don't specify any measures, the `count` measure is used by default. To work with the data we can convert it to a Pandas :class:`DataFrame` using the :meth:`to_df()` method:: >>> df = cube.to_df() >>> df People Occupation Income Gender Manual Worker <£10k Female 15624 Male 5321 Unknown 5 £10-20k Female 43051 Male 5992 ... Retired £90-100k Male 2 Unknown 0 £100k+ Female 0 Male 1 Unknown 0 [330 rows x 1 columns] The DataFrame uses a `MultiIndex <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html>`_, with each dimension as a different level in the index:: >>> df.index MultiIndex([('Manual Worker', '<£10k', 'Female'), ('Manual Worker', '<£10k', 'Male'), ('Manual Worker', '<£10k', 'Unknown'), ('Manual Worker', '£10-20k', 'Female'), ('Manual Worker', '£10-20k', 'Male'), ('Manual Worker', '£10-20k', 'Unknown'), ('Manual Worker', '£20-30k', 'Female'), ('Manual Worker', '£20-30k', 'Male'), ('Manual Worker', '£20-30k', 'Unknown'), ('Manual Worker', '£30-40k', 'Female'), ... ( 'Retired', '£70-80k', 'Unknown'), ( 'Retired', '£80-90k', 'Female'), ( 'Retired', '£80-90k', 'Male'), ( 'Retired', '£80-90k', 'Unknown'), ( 'Retired', '£90-100k', 'Female'), ( 'Retired', '£90-100k', 'Male'), ( 'Retired', '£90-100k', 'Unknown'), ( 'Retired', '£100k+', 'Female'), ( 'Retired', '£100k+', 'Male'), ( 'Retired', '£100k+', 'Unknown')], names=['Occupation', 'Income', 'Gender'], length=330) The DataFrame has one column which is the default count measure, named after the entity it is counting. Since the data values represent a count, they are all integers:: >>> df.info() <class 'pandas.core.frame.DataFrame'> MultiIndex: 330 entries, ('Manual Worker', '<£10k', 'Female') to ('Retired', '£100k+', 'Unknown') Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 People 330 non-null int64 dtypes: int64(1) memory usage: 4.6+ KB More generally, the DataFrame will have one column for each measure included on the cube. .. tip:: This initial structure of the DataFrame returned by the :meth:`to_df()` method is very similar to a *Tree* in the FastStats application. Pivoting the ``Gender`` dimension returns a :class:`DataFrame` with ``Gender`` as a new level of column labels. You may find this arrangement easier to read and more similar to how a cube would be presented in FastStats:: >>> df.unstack(level=2) People Gender Female Male Unknown Occupation Income Director <£10k 1279 832 4 £10-20k 4116 2139 9 £100k+ 2 1 1 £20-30k 1267 1050 5 £30-40k 1591 1940 46 ... ... ... Unemployed £50-60k 277 69 89 £60-70k 58 8 17 £70-80k 27 5 5 £80-90k 0 0 0 £90-100k 1 0 0 [110 rows x 3 columns] As well as Selector variables, you can also use banded Date or Datetime variables as cube dimensions. These bandings are accessible as attributes on the variable:: >>> dest = bookings["Destination"] >>> booking_date = bookings["Booking Date"] >>> booking_year_by_dest = bookings.cube([booking_date.year, dest]) >>> year_dest_df = booking_year_by_dest.to_df() >>> year_dest_df.unstack(level=0) Bookings Booking Date (Year) 2016 2017 2018 2019 2020 2021 Destination Australia 28747 32766 68271 46576 54519 51840 Denmark 158 169 355 244 299 285 France 51950 60253 125838 85903 102543 97987 Germany 47031 53691 109053 75120 89974 86391 Greece 11220 12999 27145 18558 21797 20373 Italy 3557 4272 8704 6065 7200 7083 Jamaica 19 14 42 32 38 34 Kuwait 2373 2765 5546 3907 4550 4407 Latvia 69 78 168 107 102 121 Mali 104 122 215 136 209 188 Mongolia 24 17 38 15 32 28 Namibia 250 254 553 385 476 431 New Zealand 93 109 227 113 178 159 Portugal 2222 2514 5124 3560 4184 4028 Senegal 111 109 238 183 159 181 Sierra Leone 466 613 1081 787 938 902 South Africa 88 119 208 164 199 178 Sweden 2460 2915 6120 4112 4929 4671 United States 58691 64546 134846 92003 108080 103676 The supported bandings are: ``year``, ``quarter``, ``month``, ``day`` Banded Date variables use a `PeriodIndex <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#periodindex-and-period-range>`_ in the outputted Dataframe:: >>> travel_date = bookings["Travel Date"] >>> travel_month_df = bookings.cube([travel_date.month]).to_df() >>> travel_month_df Bookings Travel Date (Month) 2016-01 410 2016-02 4059 2016-03 12233 2016-04 20175 2016-05 25601 ... 2022-08 0 2022-09 0 2022-10 0 2022-11 0 2022-12 0 [84 rows x 1 columns] >>> travel_month_df.index PeriodIndex(['2016-01', '2016-02', '2016-03', '2016-04', '2016-05', '2016-06', '2016-07', '2016-08', '2016-09', '2016-10', '2016-11', '2016-12', '2017-01', '2017-02', '2017-03', '2017-04', '2017-05', '2017-06', '2017-07', '2017-08', '2017-09', '2017-10', '2017-11', '2017-12', '2018-01', '2018-02', '2018-03', '2018-04', '2018-05', '2018-06', '2018-07', '2018-08', '2018-09', '2018-10', '2018-11', '2018-12', '2019-01', '2019-02', '2019-03', '2019-04', '2019-05', '2019-06', '2019-07', '2019-08', '2019-09', '2019-10', '2019-11', '2019-12', '2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06', '2020-07', '2020-08', '2020-09', '2020-10', '2020-11', '2020-12', '2021-01', '2021-02', '2021-03', '2021-04', '2021-05', '2021-06', '2021-07', '2021-08', '2021-09', '2021-10', '2021-11', '2021-12', '2022-01', '2022-02', '2022-03', '2022-04', '2022-05', '2022-06', '2022-07', '2022-08', '2022-09', '2022-10', '2022-11', '2022-12'], dtype='period[M]', name='Travel Date (Month)', freq='M') As in FastStats, you can also use variables from 'related' tables, that is, ancestor or descendant tables (including the direct parent and children):: >>> region = households["Region"] >>> continent = bookings["Continent"] >>> mixed_tables_cube = people.cube([region, occupation, continent]) >>> mixed_tables_cube.to_df() People Region Occupation Continent North Manual Worker Australasia 101 Europe 4158 Americas 96 Asia 2 Africa 2 ... Channel Islands Retired Australasia 7 Europe 18 Americas 10 Asia 2 Africa 2 [700 rows x 1 columns] .. note:: This is the table structure for the tables from the demo *Holidays* system being used here:: Households └─ People └─ Bookings Our cube is showing a count of `People` records, but the dimensions include the `Region` variable from the parent `Households` table, as well as the `Continent` variable from the child `Bookings` table. Controlling DataFrame output ============================ By default, the DataFrame returned by :meth:`to_df()` doesn't include unclassified values or totals. But these can be included by setting the ``unclassified`` and ``totals`` arguments (respectively) for :meth:`to_df()` to :const:`True`:: >>> gender_cube = people.cube([gender]) >>> gender_cube.to_df() People Gender Female 764796 Male 378567 Unknown 13190 >>> gender_cube.to_df(unclassified=True, totals=True) People Gender Unclassified 0 Female 764796 Male 378567 Unknown 13190 TOTAL 1156553 If any of these extra values are included for a cube with a Banded Date dimension, the corresponding DataFrame index cannot be converted to a :class:`PeriodIndex` so will be left as a plain :class:`Index`:: >>> booking_year_cube = bookings.cube([booking_date.year]) >>> booking_year_cube.to_df().index PeriodIndex(['2016', '2017', '2018', '2019', '2020', '2021'], dtype='period[A-DEC]', name='Booking Date (Year)', freq='A-DEC') >>> booking_year_cube.to_df(unclassified=True).index Index(['Unclassified', '2016', '2017', '2018', '2019', '2020', '2021'], dtype='object', name='Booking Date (Year)') To stop the index being automatically converted to a :class:`PeriodIndex` even when these extra values *aren't* included, set the ``convert_index`` argument to :const:`False`:: >>> booking_year_cube.to_df(convert_index=False).index Index(['2016', '2017', '2018', '2019', '2020', '2021'], dtype='object', name='Booking Date (Year)') Applying a selection ==================== You can filter the records used in the analysis for the cube by applying a selection to it using the `selection` parameter:: >>> student = people["Occupation"] == "4" >>> student_cube = people.cube([occupation, region, continent], selection=student) >>> student_cube.to_df() People Occupation Region Continent Manual Worker North Australasia 0 Europe 0 Americas 0 Asia 0 Africa 0 ... Retired Channel Islands Australasia 0 Europe 0 Americas 0 Asia 0 Africa 0 [700 rows x 1 columns] You can also build the cube directly from the selection using the :meth:`cube` method:: >>> student_cube = student.cube([occupation, region, continent]) >>> student_cube.to_df() People Occupation Region Continent Manual Worker North Australasia 0 Europe 0 Americas 0 Asia 0 Africa 0 ... Retired Channel Islands Australasia 0 Europe 0 Americas 0 Asia 0 Africa 0 [700 rows x 1 columns] *(this cube is identical to the previous one)* When creating a data grid from a selection, the table for the data grid is automatically set as the selection's table. Applying a selection from a different table =========================================== Just as in FastStats, you can apply a selection based on a table different from the one used in your cube:: >>> scotland = region == "10" >>> scotland_cube = people.cube([occupation, region, continent], selection=scotland) >>> scotland_cube.to_df() People Occupation Region Continent Manual Worker North Australasia 0 Europe 0 Americas 0 Asia 0 Africa 0 ... Retired Channel Islands Australasia 0 Europe 0 Americas 0 Asia 0 Africa 0 [700 rows x 1 columns] Here, ``scotland`` is a selection on the `Households` table, but we are applying it to a `People` cube. .. note:: The selection's table must be the same as the cube, or a 'related' table – either an ancestor or descendant. Again, you can build the cube from the selection itself, but this time you will need to use the `table` parameter to set the cube to the desired table:: >>> scotland_cube = scotland.cube([occupation, region, continent], table=people) >>> scotland_cube.to_df() People Occupation Region Continent Manual Worker North Australasia 0 Europe 0 Americas 0 Asia 0 Africa 0 ... Retired Channel Islands Australasia 0 Europe 0 Americas 0 Asia 0 Africa 0 [700 rows x 1 columns] *(this cube is identical to the previous one)* Adding different measures ========================= The :meth:`cube` method has a `measures` argument, which you can use to specify which measures appear on the cube. If this is not set, the default `count` measure is used automatically. To display a count for another table pass its :class:`Table` object in the `measures` list:: >>> dest_cube = bookings.cube([dest], measures=[bookings, people, households]) >>> dest_cube.to_df() Bookings People Households Destination Australia 282719 274857 194412 New Zealand 879 870 862 Denmark 1510 1497 1483 France 524474 522932 359194 Germany 461260 214163 212004 Greece 112092 109564 93208 Italy 36881 25958 25612 Latvia 645 551 542 Portugal 21632 21230 20723 Sweden 25207 25175 24055 Jamaica 179 178 175 United States 561842 273879 195757 Kuwait 23548 23388 22349 Mongolia 154 154 151 Mali 974 968 961 Namibia 2349 2276 2232 Senegal 981 777 766 Sierra Leone 4787 4778 4677 South Africa 956 954 945 *(if you want the original count as well, you will now need to include that explicitly in the measures list)* Each measure is returned as a separate column on the DataFrame. You can also add variable statistics which can be imported from the :mod:`apteco.statistics` module:: >>> from apteco.statistics import Sum, Mean >>> cost = bookings["Cost"] >>> profit = bookings["Profit"] >>> finance_cube = bookings.cube([dest, gender], measures=[Mean(cost), Sum(profit)]) >>> finance_df = finance_cube.to_df() >>> finance_df.unstack(1) Mean(Cost) Sum(Profit) Gender Female Male Unknown Female Male Unknown Destination Australia 641.33 642.80 641.63 4969609.27 9289372.33 338911.11 Denmark 636.56 552.11 673.20 77696.43 7172.64 11138.75 France 644.34 645.12 NaN 28028207.53 12743135.05 0.00 Germany 643.13 688.66 739.48 41693688.99 49751.52 2396.78 Greece 644.01 643.99 652.30 3969380.08 7381942.89 259677.61 Italy 638.27 634.36 658.72 3362435.25 502193.53 402007.87 Jamaica 597.36 468.56 807.87 1770.04 290.77 15.51 Kuwait 650.15 645.65 659.84 298712.85 554525.03 67014.61 Latvia 693.92 618.84 712.12 11177.66 19882.72 2592.66 Mali 596.55 661.02 771.80 18392.77 40416.47 4736.45 Mongolia 579.34 636.74 677.02 4193.65 7141.89 2213.89 Namibia 704.16 638.48 542.54 31981.71 57459.89 3439.89 New Zealand 633.99 641.36 625.06 17982.15 33741.93 4577.55 Portugal 636.00 650.95 647.02 890531.92 554791.39 229951.78 Senegal 656.55 658.48 518.11 60939.12 28611.82 946.95 Sierra Leone 614.09 652.47 597.09 165479.74 353473.04 33155.92 South Africa 682.05 694.72 748.89 41997.60 84264.98 6911.85 Sweden 641.35 644.49 652.47 1232007.22 2296749.68 57618.93 United States 638.56 640.62 632.76 25279636.01 46492373.17 7632493.15 .. seealso:: For more information on working with DataFrames with a MultiIndex, see the `user guide <https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html>`_ in the official Pandas documentation. That's the end of the tutorial! Hopefully you're now equipped with the knowledge you need to get started on building your own selections, data grids and cubes. Check out the rest of the documentation for more guidance, and if you have any questions don't hesitate to get in touch with Apteco Support (support@apteco.com) who will be happy to help.