Selections

Introduction

A FastStats Selection is represented in py-apteco by a Clause object, possibly containing other nested or connected Clause objects, which combine to make the rule defining a set of records to be selected. The table from which to select the records is also embedded in the rule.

As well as the fundamental action of counting a selection to see how many records in the table match the conditions defined by the rule, selections form the basis of many other pieces of analysis and can be used in many different contexts.

Basic use

Setting up variables:

>>> from datetime import date
>>> dest = bookings["Destination"]
>>> trav = bookings["Travel Date"]
>>> cost = bookings["Cost"]

Creating a selection:

>>> sweden = dest == "29"
>>> at_least_2k = cost >= 2000
>>> before_2020 = trav <= date(2019, 12, 31)

Counting a selection:

>>> sweden.count()
25207

Combining selections:

>>> sweden_before_2020 = sweden & before_2020
>>> sweden_or_expensive = sweden | at_least_2k

Changing table:

>>> been_to_sweden = people * sweden

Taking sample:

>>> random_3_pct_sweden = sweden.sample(frac=0.03, sample_type="Random")

Applying limit:

>>> top_1000_sweden_by_cost = sweden.limit(1000, by=cost)

API Reference

Core attributes & methods

table: Table

resolve table of this selection

table_name: str

name of the resolve table of this selection

count()

return the number of records in this selection

Sampling and limits

sample(n=None, frac=None, sample_type="Random", skip_first=0, *, label=None)

Take a sample of records from the selection.

Parameters
  • n (int) – Number of records to return from selection. Cannot be used with frac.

  • frac (float) – Proportion of records to return out of whole selection, given as a number between 0 and 1. Cannot be used with n.

  • sample_type ({'Random', 'Stratified', 'First'}) – Type of sampling to use. Default is ‘Random’.

  • skip_first (int) – Number of records to skip from start of selection. Default is 0.

  • label (str) – Optional textual name for this selection clause.

limit(n=None, frac=None, by=None, ascending=None, per=None, *, label=None)

Limit the selection to a subset of records.

Parameters
  • n (int or tuple) – Number of records to return from selection. Cannot be used with frac. If by is given, a tuple of two integers (i, j) may be passed to select from the ith to the jth records.

  • frac (float or tuple) – Proportion of records to return out of whole selection, given as a number between 0 and 1. Cannot be used with n. If by is given, a tuple of two numbers (p, q) may be passed to select the proportion of records between them. For example frac=(0.1, 0.25) with ascending=False would give the top 10–25% of records.

  • by (Variable) – Variable specifying order in which records are selected.

  • ascending (bool, optional) – Whether to order records ascending (True) or descending (False) when selecting limit. Must be used with by. Default is False.

  • per (Table or Variable) – Return n records per this entity. Cannot be used with frac. If per is a Table, it must be a parent or ancestor table of the selection’s table, and for each record on this table n child records are returned from the selection. If per is a Variable, n records are returned for each value of this variable. If per is a selector variable, this means n records for each selector category.

  • label (str) – Optional textual name for this selection clause.

Data Grids and Cubes

datagrid(columns, table=None, max_rows=1000)

Build a data grid with this selection underlying it.

>>> cols = (
        [people[var] for var in ("Initial", "Surname")]
        + [bookings[var] for var in ("boDate", "boCost", "boDest")]
    )
>>> northern = households["Region"] == ["01", "02", "13"]
>>> datagrid = bookings.datagrid(cols, northern, max_rows=100)
>>> datagrid.to_df().head()
  Initial   Surname Booking Date     Cost    Destination
0       A     Allen   2020-08-11   551.81         France
1       W   Livesey   2021-08-02  1167.57   Sierra Leone
2       W   Livesey   2021-08-19   562.56  United States
3       W   Livesey   2021-08-08   960.55      Australia
4       O  Robinson   2021-08-22   455.60  United States

See also

This method is a wrapper around the DataGrid class. Refer to the Data Grid documentation for more details.

cube(dimensions, measures=None, table=None)

Build a cube with this selection underlying it.

>>> cube = bookings.cube(
        [people["Occupation"], bookings["Product"]],
        selection=(bookings["Cost"] > 200),
    )
>>> df = cube.to_df()
>>> df.unstack().rename(columns=lambda x: x.split(" ")[0])
                    Bookings
Product         Accommodation  Flight Package
Occupation
Director                 1714    8477   24585
Manager                  4422   28566  109725
Manual Worker            4039   27104   77547
Professional             1806    9728   40072
Public Sector           18308   82437  249637
Retail Worker            9864   30853  126350
Retired                 12750   47333   86594
Sales Executive         35214  152911  407288
Student                  6553   27665  145156
Unemployed               8999   30648   57211

See also

This method is a wrapper around the Cube class. Refer to the Cube documentation for more details.