Selections¶
Introduction¶
A FastStats Selection is represented in py-apteco by a Clause
object,
possibly containing other nested or connected Clause
objects,
which combine to make the rule defining a set of records to be selected.
The table from which to select the records is also embedded in the rule.
As well as the fundamental action of counting a selection to see how many records in the table match the conditions defined by the rule, selections form the basis of many other pieces of analysis and can be used in many different contexts.
Basic use¶
Setting up variables:
>>> from datetime import date
>>> dest = bookings["Destination"]
>>> trav = bookings["Travel Date"]
>>> cost = bookings["Cost"]
Creating a selection:
>>> sweden = dest == "29"
>>> at_least_2k = cost >= 2000
>>> before_2020 = trav <= date(2019, 12, 31)
Counting a selection:
>>> sweden.count()
25207
Combining selections:
>>> sweden_before_2020 = sweden & before_2020
>>> sweden_or_expensive = sweden | at_least_2k
Changing table:
>>> been_to_sweden = people * sweden
Taking sample:
>>> random_3_pct_sweden = sweden.sample(frac=0.03, sample_type="Random")
Applying limit:
>>> top_1000_sweden_by_cost = sweden.limit(1000, by=cost)
API Reference¶
Core attributes & methods¶
-
table
: Table¶ resolve table of this selection
-
table_name
: str¶ name of the resolve table of this selection
-
count
()¶ return the number of records in this selection
Sampling and limits¶
-
sample
(n=None, frac=None, sample_type="Random", skip_first=0, *, label=None)¶ Take a sample of records from the selection.
- Parameters
n (int) – Number of records to return from selection. Cannot be used with frac.
frac (float) – Proportion of records to return out of whole selection, given as a number between 0 and 1. Cannot be used with n.
sample_type ({'Random', 'Stratified', 'First'}) – Type of sampling to use. Default is ‘Random’.
skip_first (int) – Number of records to skip from start of selection. Default is 0.
label (str) – Optional textual name for this selection clause.
-
limit
(n=None, frac=None, by=None, ascending=None, per=None, *, label=None)¶ Limit the selection to a subset of records.
- Parameters
n (int or tuple) – Number of records to return from selection. Cannot be used with frac. If by is given, a tuple of two integers (i, j) may be passed to select from the ith to the jth records.
frac (float or tuple) – Proportion of records to return out of whole selection, given as a number between 0 and 1. Cannot be used with n. If by is given, a tuple of two numbers (p, q) may be passed to select the proportion of records between them. For example frac=(0.1, 0.25) with ascending=False would give the top 10–25% of records.
by (Variable) – Variable specifying order in which records are selected.
ascending (bool, optional) – Whether to order records ascending (True) or descending (False) when selecting limit. Must be used with by. Default is False.
per (Table or Variable) – Return n records per this entity. Cannot be used with frac. If per is a Table, it must be a parent or ancestor table of the selection’s table, and for each record on this table n child records are returned from the selection. If per is a Variable, n records are returned for each value of this variable. If per is a selector variable, this means n records for each selector category.
label (str) – Optional textual name for this selection clause.
Data Grids and Cubes¶
-
datagrid
(columns, table=None, max_rows=1000)¶ Build a data grid with this selection underlying it.
>>> cols = ( [people[var] for var in ("Initial", "Surname")] + [bookings[var] for var in ("boDate", "boCost", "boDest")] ) >>> northern = households["Region"] == ["01", "02", "13"] >>> datagrid = bookings.datagrid(cols, northern, max_rows=100) >>> datagrid.to_df().head() Initial Surname Booking Date Cost Destination 0 A Allen 2020-08-11 551.81 France 1 W Livesey 2021-08-02 1167.57 Sierra Leone 2 W Livesey 2021-08-19 562.56 United States 3 W Livesey 2021-08-08 960.55 Australia 4 O Robinson 2021-08-22 455.60 United States
See also
This method is a wrapper around the
DataGrid
class. Refer to the Data Grid documentation for more details.
-
cube
(dimensions, measures=None, table=None)¶ Build a cube with this selection underlying it.
>>> cube = bookings.cube( [people["Occupation"], bookings["Product"]], selection=(bookings["Cost"] > 200), ) >>> df = cube.to_df() >>> df.unstack().rename(columns=lambda x: x.split(" ")[0]) Bookings Product Accommodation Flight Package Occupation Director 1714 8477 24585 Manager 4422 28566 109725 Manual Worker 4039 27104 77547 Professional 1806 9728 40072 Public Sector 18308 82437 249637 Retail Worker 9864 30853 126350 Retired 12750 47333 86594 Sales Executive 35214 152911 407288 Student 6553 27665 145156 Unemployed 8999 30648 57211
See also
This method is a wrapper around the
Cube
class. Refer to the Cube documentation for more details.