Data Grid

Introduction

The DataGrid object corresponds to a FastStats Data Grid and is an exported set of data from the FastStats system.

The data to export is specified by:

  • a resolve table – each row in the data grid corresponds to a single record in this table

  • a set of columns – these are the fields of data returned for each record

  • a base selection (optional) – this filters records in the table to only return ones that match the given criteria

There are also other parameters that determine things such as how many rows of data to return.

Note

The data grid functionality is still under active development, and so is currently subject to a couple of limitations:

  • The data grid columns must be FastStats variables (expressions and other more complex column types are not currently supported)

  • Array and FlagArray variables are not currently supported as columns

Basic use

Setting up variables:

>>> bookings = my_session.tables["Bookings"]
>>> urn = bookings["Booking URN"]
>>> dest = bookings["Destination"]
>>> trav = bookings["Travel Date"]
>>> cost = bookings["Cost"]

Creating a data grid:

>>> datagrid = bookings.datagrid([urn, dest, trav, cost])

Converting to a Pandas DataFrame:

>>> df = datagrid.to_df()
>>> df.head()
  Booking URN   Destination Travel Date     Cost
0    10001265        France  2019-05-15  1392.35
1    10001266        France  2019-06-05   780.34
2    10011532       Germany  2021-08-29   181.68
3    10011533       Germany  2021-08-21   300.67
4    10015830  Unclassified  2016-05-02   228.70

Specifying the number of rows to return:

>>> datagrid = bookings.datagrid([urn, dest, trav, cost], max_rows=100)
>>> df = datagrid.to_df()
>>> len(df)
100

Using a base selection to filter the records:

>>> sweden = dest == "29"
>>> sweden_datagrid = sweden.datagrid([urn, dest, trav, cost])
>>> sweden_df = sweden_datagrid.to_df()
>>> sweden_df.head()
  Booking URN Destination Travel Date     Cost
0    10172319      Sweden  2020-05-15  1201.81
1    10384970      Sweden  2018-04-05   344.30
2    10421011      Sweden  2019-07-11   322.89
3    10425298      Sweden  2020-10-10   880.02
4    10479109      Sweden  2020-09-21   172.91

Using a base selection from a different table:

>>> households = my_session.tables["Households"]
>>> town = households["Town"]
>>> postcode = households["Postcode"]
>>> manchester = households["Region"] == "13"
>>> manc_datagrid = manchester.datagrid([urn, dest, trav, cost, town, postcode], table=bookings)
>>> manc_df = manc_datagrid.to_df()
>>> manc_df.head()
  Booking URN    Destination Travel Date     Cost    Town Postcode
0    10172319         Sweden  2020-05-15  1201.81  Bolton  BL1 5XB
1    10172320  United States  2020-04-14  1616.80  Bolton  BL1 5XB
2    10173729         France  2020-08-19   581.71  Bolton  BL1 8JJ
3    10173730         France  2020-08-09  2224.70  Bolton  BL1 8JJ
4    10177047         France  2021-05-07   686.53  Bolton  BL3 5LX

API reference

class apteco.datagrid.DataGrid(columns, selection=None, table=None, *, session=None)

Create a data grid.

Tip

The datagrid() methods on tables and selections are wrappers around this class. It is recommended to prefer those over instantiating this class directly, as they generally provide a simpler interface.

Parameters
  • columns (list[Variable]) – variables to use as columns in the data grid. These must be from table or from one of its ancestor tables.

  • selection (Clause) – base selection to apply to the data grid. The table of this selection must be table or a ‘related’ table – either an ancestor or descendant.

  • table (Table) – resolve table of the data grid. Each row of the data grid will correspond to a record from this table.

  • max_rows (int) – maximum number of records to return (default is 1000).

  • session (Session) – current Apteco API session.

At least one of selection or table must be given:

  • If only selection is given, then table will be set to the resolve table of the selection.

  • If both are given and the resolve table of selection isn’t table, then the records returned in the data grid are determined by mapping the selection to the required table by applying ANY/THE logic as necessary. This matches the behaviour when applying an underlying selection to a data grid in the FastStats application. The mapping described here happens in the FastStats data engine and does not change the selection on the DataGrid.

Tip

The following two data grids are equivalent:

>>> datagrid1 = DataGrid(
...     columns,
...     selection=manchester,
...     table=bookings,
...     session=my_session,
... )
>>> datagrid2 = DataGrid(
...     columns,
...     selection=(bookings * manchester),
...     session=my_session,
... )

They both return a data grid of bookings made by people from households in the Greater Manchester region.

Note

The raw data is fetched from the Apteco API when the DataGrid object is initialised. It is held on the object in the _data attribute as a list of tuples but this is not considered public, and so to work with the data you should convert it to your desired output format. The only format currently supported is a Pandas DataFrame.

to_df()

Return the data as a Pandas DataFrame.

The DataFrame is configured such that:

  • the index is a RangeIndex

  • the columns (headings) are the variable descriptions

  • data is returned as its corresponding Pandas column type or native Python type

  • Selector variable columns contain strings of the category descriptions

>>> pol_num = policies["Policy Number"]
>>> premium = policies["Premium"]
>>> cover = policies["Cover"]
>>> dob = people["DOB"]
>>> postcode = households["Postcode"]
>>> policies_datagrid = policies.datagrid([pol_num, premium, cover, dob, postcode])
>>> policies_datagrid.to_df()
    Policy Number  Premium        Cover         DOB  Postcode
0        10001265    87.02   Individual  1975-02-09  AB10 1XL
1        10036397   123.30   Individual  1972-11-25    B6 4TN
2        10078565   143.20   Multi Trip  1971-10-05   B74 2QX
3        10078566    29.23       Family  1971-10-05   B74 2QX
4        10078567    14.65  Single Trip  1999-11-14   B74 2QX
..            ...      ...          ...         ...       ...
995      11192414    17.83  Single Trip         NaT  ME15 0QB
996      11205561    10.43   Individual  1976-01-30  MK18 7ZT
997      11242733    33.56   Individual  1977-01-22   N16 7NJ
998      11252163    13.57   Individual  1997-09-12   NE2 2DJ
999      11262841    11.19   Individual  1978-08-03  NE20 9QJ

[1000 rows x 5 columns]

See also

For more details on working with a Pandas DataFrame see the official Pandas documentation.