8. Creating Data Grids

Exporting data into a data grid makes it available in a format that lets us easily carry out further analysis on it. In this part of the tutorial we’ll learn how to create data grids, and how to control what data is exported into them.

Creating a data grid

We can use the datagrid() method on a table to create a data grid showing records from that table. We pass it a list of the FastStats variables we want to use as data grid columns:

>>> pol_num = policies["Policy Number"]
>>> pol_date = policies["Policy Date"]
>>> premium = policies["Premium"]
>>> cover = policies["Cover"]
>>> days_to_travel = policies["Days Until Travel"]
>>> datagrid = policies.datagrid([pol_num, pol_date, premium, cover, days_to_travel])

To work with the data we can convert it to a Pandas DataFrame using the to_df() method:

>>> df = datagrid.to_df()
>>> df
    Policy Number Policy Date  Premium        Cover  Days Until Travel
0        10001265  2019-05-21    87.02   Individual                 25
1        10036397  2017-04-09   123.30   Individual                 91
2        10078565  2018-09-05   143.20   Multi Trip                 30
3        10078566  2018-10-07    29.23       Family                 60
4        10078567  2018-10-03    14.65  Single Trip                 10
..            ...         ...      ...          ...                ...
995      11192414  2022-01-06    17.83  Single Trip                 26
996      11205561  2016-07-09    10.43   Individual                 35
997      11242733  2021-02-23    33.56   Individual                 10
998      11252163  2021-03-23    13.57   Individual                 41
999      11262841  2020-07-28    11.19   Individual                 86

[1000 rows x 5 columns]

Where possible, the to_df() method will convert the raw data to the corresponding native data types:

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   Policy Number      1000 non-null   object
 1   Policy Date        1000 non-null   object
 2   Premium            1000 non-null   float64
 3   Cover              1000 non-null   object
 4   Days Until Travel  1000 non-null   int64
dtypes: float64(1), int64(1), object(3)
memory usage: 39.2+ KB

Pandas doesn’t currently have a column type for strings or dates (though it does have a datetime column type), so these are stored as ordinary Python objects. However if we select an individual string or date from the DataFrame we can see that it’s a native str or datetime.date:

>>> df.loc[2, "Cover"]
'Multi Trip'
>>> df.loc[3, "Policy Date"]
datetime.date(2018, 10, 7)

You can use any variable type as a data grid column, except for Array and Flag Array variables. As in FastStats, you can also use variables from parent or ancestor tables, but not child or descendant tables:

>>> occu = people["Occupation"]
>>> town = households["Town"]
>>> mixed_tables_datagrid = policies.datagrid([pol_num, premium, occu, cover, town])
>>> mixed_tables_datagrid.to_df()
    Policy Number  Premium        Cover       Occupation                 Town
0        10001265    87.02   Individual  Sales Executive             Aberdeen
1        10036397   123.30   Individual     Professional           Birmingham
2        10078565   143.20   Multi Trip  Sales Executive     Sutton Coldfield
3        10078566    29.23       Family  Sales Executive     Sutton Coldfield
4        10078567    14.65  Single Trip    Public Sector     Sutton Coldfield
..            ...      ...          ...              ...                  ...
995      11192414    17.83  Single Trip    Retail Worker            Maidstone
996      11205561    10.43   Individual    Public Sector           Buckingham
997      11242733    33.56   Individual         Director               London
998      11252163    13.57   Individual    Public Sector  Newcastle-upon-Tyne
999      11262841    11.19   Individual    Public Sector  Newcastle-upon-Tyne

[1000 rows x 5 columns]

Note

This is the table structure for the tables from the demo Holidays system being used here:

Households
 └─ People
     └─ Policies

Our data grid is displaying Policies records, but the columns include the Occupation variable from the parent People table, as well as the Town variable from the ‘grandparent’ Households table.

Controlling the number of rows

By default, the data grid will return 1000 rows, but we can adjust this using the max_rows parameter:

>>> columns = [pol_num, premium, occupation, cover, town]
>>> only_10_policies_datagrid = policies.datagrid(columns, max_rows=10)
>>> only_10_policies_datagrid.to_df()
  Policy Number  Premium       Occupation        Cover              Town
0      10001265    87.02  Sales Executive   Individual          Aberdeen
1      10036397   123.30     Professional   Individual        Birmingham
2      10078565   143.20  Sales Executive   Multi Trip  Sutton Coldfield
3      10078566    29.23  Sales Executive       Family  Sutton Coldfield
4      10078567    14.65    Public Sector  Single Trip  Sutton Coldfield
5      10090306    30.50     Professional       Family          Redditch
6      10094721    83.78    Public Sector       Family              Bath
7      10109667    25.48          Manager   Individual            Yeovil
8      10109674    33.87    Public Sector   Multi Trip            Yeovil
9      10123545    50.56  Sales Executive   Individual            Nelson

Note

As its name suggests, the number of rows in the data grid isn’t necessarily equal to max_rows – it just provides an upper limit. The number of rows will be less than this if there are fewer records available.

Applying a selection

You can filter the records shown in the data grid by applying a selection to it using the selection parameter:

>>> multi_trip = policies["Cover"] == "4"
>>> multi_trip_datagrid = policies.datagrid(columns, selection=multi_trip)
>>> multi_trip_datagrid.to_df()
    Policy Number  Premium       Occupation       Cover              Town
0        10078565   143.20  Sales Executive  Multi Trip  Sutton Coldfield
1        10109674    33.87    Public Sector  Multi Trip            Yeovil
2        10278405    56.27  Sales Executive  Multi Trip             Newry
3        10326709    21.99  Sales Executive  Multi Trip         Cambridge
4        10425299    38.50          Student  Multi Trip     South Croydon
..            ...      ...              ...         ...               ...
995      10976366    87.88  Sales Executive  Multi Trip             Esher
996      11131381    14.12       Unemployed  Multi Trip            Pudsey
997      11258013   148.09    Public Sector  Multi Trip         Gateshead
998      11519086    45.61  Sales Executive  Multi Trip      Bognor Regis
999      11783140    12.66     Professional  Multi Trip        Malmesbury

[1000 rows x 5 columns]

You can also build the data grid directly from the selection using the datagrid() method:

>>> multi_trip_datagrid = multi_trip.datagrid(columns)
>>> multi_trip_datagrid.to_df()
    Policy Number  Premium       Occupation       Cover              Town
0        10078565   143.20  Sales Executive  Multi Trip  Sutton Coldfield
1        10109674    33.87    Public Sector  Multi Trip            Yeovil
2        10278405    56.27  Sales Executive  Multi Trip             Newry
3        10326709    21.99  Sales Executive  Multi Trip         Cambridge
4        10425299    38.50          Student  Multi Trip     South Croydon
..            ...      ...              ...         ...               ...
995      10976366    87.88  Sales Executive  Multi Trip             Esher
996      11131381    14.12       Unemployed  Multi Trip            Pudsey
997      11258013   148.09    Public Sector  Multi Trip         Gateshead
998      11519086    45.61  Sales Executive  Multi Trip      Bognor Regis
999      11783140    12.66     Professional  Multi Trip        Malmesbury

[1000 rows x 5 columns]

(this data grid is identical to the previous one)

When creating a data grid from a selection, the table for the data grid is automatically set as the selection’s table.

Applying a selection from a different table

Just as in FastStats, you can apply a selection based on a table different from the one used in your data grid:

>>> student = people["Occupation"] == "4"
>>> student_policies_datagrid = policies.datagrid(columns, selection=student)
>>> student_policies_datagrid.to_df()
    Policy Number  Premium Occupation        Cover            Town
0        10152036    33.82    Student  Single Trip     Bournemouth
1        10165468    23.22    Student   Individual        Ferndown
2        10173730   117.09    Student       Family          Bolton
3        10415201    10.29    Student   Individual         Sudbury
4        10418882    30.11    Student   Individual  Clacton-on-Sea
..            ...      ...        ...          ...             ...
995      10204828    34.75    Student       Family      Eastbourne
996      10423854    12.87    Student       Family         Croydon
997      10467615    16.68    Student  Single Trip  Leamington Spa
998      10597061    34.80    Student  Single Trip     Stourbridge
999      10629056    17.79    Student   Individual          London

[1000 rows x 5 columns]

Here, student is a selection on the People table, but we are applying it to a Policies data grid.

Note

The selection’s table must be the same as the data grid, or a ‘related’ table – either an ancestor or descendant.

Again, you can build the data grid from the selection itself, but this time you will need to use the table parameter to set the data grid to the desired table:

>>> student_policies_datagrid = student.datagrid(columns, table=policies)
>>> student_policies_datagrid.to_df()
    Policy Number  Premium Occupation        Cover            Town
0        10152036    33.82    Student  Single Trip     Bournemouth
1        10165468    23.22    Student   Individual        Ferndown
2        10173730   117.09    Student       Family          Bolton
3        10415201    10.29    Student   Individual         Sudbury
4        10418882    30.11    Student   Individual  Clacton-on-Sea
..            ...      ...        ...          ...             ...
995      10204828    34.75    Student       Family      Eastbourne
996      10423854    12.87    Student       Family         Croydon
997      10467615    16.68    Student  Single Trip  Leamington Spa
998      10597061    34.80    Student  Single Trip     Stourbridge
999      10629056    17.79    Student   Individual          London

[1000 rows x 5 columns]

(this data grid is identical to the previous one)

As well as exporting data to a data grid to do further analysis on it externally, FastStats has built-in support for many kinds of analysis. In the next part, we’ll learn how to create Cubes for carrying out multi-dimensional tabular analysis on our data.