Setvis fundamentals and the Membership class¶
- class setvis.membership.Col(name: str)¶
An alias for
Set, used when referring to a ‘column’ (hence, Col) rather than a ‘set’ is more natural, for example, when referring to the set of elements with missing data in a particular column.
- class setvis.membership.Membership(intersection_id_to_columns: DataFrame, intersection_id_to_records: DataFrame, set_mode: bool = False, check: bool = True)¶
A storage-efficient representation of a collection of sets and the elements that belong to them (including their various intersections). The representation is optimised for particular queries related to membership.
An interpretation of sets as ‘columns’ and elements as boolean-valued ‘records’ is often useful, so sets/columns and elements/records are used interchangeably.
Using one of the various named constructors (
Membership.from_data_frame(),Membership.from_csv(),Membership.from_membership_data_frame()orMembership.from_membership_csv()) is the preferred way to construct a Membership object.If directly using the default constructor, the internal representation of the data is passed as two pandas dataframes, with these and the other arguments as given below.
- Parameters:
intersection_id_to_columns (pd.DataFrame) – A dataframe with as many rows as there are unique intersections (patterns of membership or missingness), and a column for each set (or column in the original dataframe)
intersection_id_to_records (pd.DataFrame) – A dataframe with as many rows as there are records. The index of this DataFrame must be named “intersection_id” with a foreign key relationship to intersection_id_to_columns. It doesn’t have to be a unique index (and generally won’t be). This dataframe otherwise has a single column
check (bool) – Check that the internal representation satisfies the required invariants.
- columns() List[Any]¶
The names of the sets included in the Membership object (or equivalently, the ‘columns’).
- count_intersections() DataFrame¶
Distinct set intersections, and the count of each
- Parameters:
count_col_name – The name of the column in the result holding the count data
- Returns:
A dataframe containing the intersections and the number of times each appears in the dataset
- count_matching_records(intersection_spec: SetExpr) int¶
Equivalent to
len(self.matching_records(intersection_spec)), but could have a performance benefit
- drop_columns(selection: List | None) Membership¶
Return a Membership object excluding the given column selection
See also
- drop_intersections(selection: Sequence[int] | None) Membership¶
Return a Membership object excluding the given intersection ids
See also
- drop_records(selection: Sequence[int] | None) Membership¶
Return a Membership object excluding the given record ids
See also
- empty_intersection() Series¶
A helper function that returns a boolean-valued pandas Series indicating the ‘empty’ intersection (True for this intersection_id, False for all others).
The ‘empty’ intersection is the unique intersection excluding all of the original sets.
- classmethod from_csv(filepath_or_buffer, read_csv_args=None, **kwargs) Membership¶
Construct a Membership object from a csv file
The data is first loading into a pandas dataframe and passed to
from_data_frame().- Parameters:
filepath_or_buffer (file-like) – The file-like to load with
pandas.read_csv()read_csv_args (Dict) – A dictionary of keyword arguments forwarded to
pandas.read_csv()**kwargs – All other arguments are forwarded to
from_data_frame()
See also
from_data_frameLoad data from a data frame
- classmethod from_data_frame(df: ~pandas.core.frame.DataFrame, is_missing: ~typing.Callable[[~typing.Any], bool] = <function isna>, set_mode: bool = False)¶
Construct a Membership object from a dataframe
This function produces a Membership object from a dataframe, interpretting the records according to one of two modes, ‘missingness mode’ or ‘set mode’ (see below).
In either case, the sets under consideration are named by the columns of the data frame, and the elements belonging to one or more of them indicated by the records in the dataframe, with each element named after the index in the dataframe.
How a record is interpretted as belonging to a particular set differs between the two modes.
In missingness mode (the default mode, enabled explicitly by passing
set_mode=False), a record belongs to the set named after a column, if and only if evaluatingis_missingon the entry in that column is truthy. The sets in this mode are used to capture the missingness of data by column. The particular intersection a record belongs corresponds to the pattern of missingness (which columns contain missing data) in that record.Example
As an example of using missingness mode, consider a dataframe
df, given by(index)
variety
length
0
NA
5.0
1
“green”
NA
2
NA
1.0
3
“orange”
2.5
There are three missing values in this dataframe, indicated by NA.
Membership.from_data_frame(df)is an object representing the missing data. This is equivalent to the data in the following dataframe (although it isn’t represented in quite this way):(index)
variety missing?
length missing?
0
True
False
1
False
True
2
True
False
3
False
False
The mapping from the values in the dataframe to booleans is controlled by the is_missing argument.
We observe that:
The records 0 and 2 have the same missingness pattern (of
(True, False))Record 3 has no missing elements
and so on. The other methods on this class can be used to answer such queries efficiently.
In set mode (enabled by passing
set_mode=True), the membership of a record to a set is indicated by the (boolean) value in the corresponding column.Note
Set mode is in fact equivalent to passing
is_missing = lambda r: r.astype(bool)in ‘missingness mode’ on a subset of columns whose names begin with ‘category@’.Example
As an example of using set mode, consider a dataframe
df, given by(index)
category@A
category@B
category@C
0
True
False
True
1
True
False
False
2
True
False
True
3
False
False
False
We can construct a Membership object from this dataframe as:
Membership.from_data_frame(df, set_mode=True)In this example, there are four records (labelled 0–3 by the index column), and three sets (A, B and C).
We can observe the following:
Record 0 is an element of both A and C, but not B
Record 0 is therefore an element of the intersection of A, C and the complement of B
Record 2 belongs to the same ‘intersection’ as record 0
Set B doesn’t contain any elements (it is empty)
Record 3 isn’t a member of any of the sets. It is member of the ‘empty’ intersection (the intersection of the complements of A, B and C)
- Parameters:
df (pd.DataFrame) – The input dataframe
is_missing (any Callable, that returns a boolean) – A predicate used in ‘missingness mode’, to determine if an element of the dataframe is missing. The default is
pd.isnullset_mode (bool) – The dataframe is interpreted as set_mode (True) or missingness mode (False)
See also
from_membership_data_framefor loading membership information from a dataframe with a different format
- classmethod from_membership_csv(filepath_or_buffer, read_csv_args=None, **kwargs) Membership¶
Construct a Membership object from a ‘membership’ csv file
The data is first loading into a pandas dataframe and passed to
from_membership_data_frame().- Parameters:
filepath_or_buffer (file-like) – The file-like to load with
pandas.read_csv()read_csv_args (Dict) – A dictionary of keyword arguments forwarded to
pandas.read_csv()**kwargs – All other keyword arguments are forwarded to
from_membership_data_frame()
See also
- classmethod from_membership_data_frame(df, membership_column='set_membership', membership_separator='|', **kwargs) Membership¶
Construct a Membership object from a ‘membership’ dataframe
A membership dataframe is any dataframe with a column indicating membership of named sets, in a particular format.
The column is by default named ‘set_membership’ (but this can be specified with membership_column). Other columns are ignored.
An entry in the membership column must be a string, to be interpreted as the name of the sets that a record belongs to, separated by “|” (the default, changed with the membership_separator argument)
Additional keyword arguments are forwarded to the class constructor.
Example
df = pd.DataFrame({ "set_membership": ["A", "A,B,C", "B,C", ""] }) df_membership = Membership.from_membership_data_frame( df, membership_seperator="," )
After running the above,
dfis(index)
set_membership
0
“A”
1
“A,B,C”
2
“B,C”
3
and
df_membershipis an object representing three sets, A (containing records 0 and 1), B (containing 1 and 2) and C (also containing 1 and 2). Record 3 is not a member of any of the sets.Note
This constructor is careful never to produce a ‘dense’ dataframe, so can be useful when there are many sparse sets to avoid constructing a large intermediate dataframe.
- classmethod from_postgres(conn, relation: str, key: str, schema: str | None = 'public') Membership¶
Construct a Membership object from a postgres database connection.
Currently, membership is determined as with ‘missingness mode’ (see
from_dataframe()). An entry in the table is considered missing if it isNULL.Note
This constructor requires psycopg2 to be installed.
Note
The database connection must have permission to create temporary tables.
- Parameters:
conn (pyscopg2 connection object) – the database connection
relation (str) – the name of the relation (table) in the database from which to load data
key (str) – the name of the column to use as the ‘record id’ (must be a unique key)
schema (Optional[str]) – the name of the schema to which the relation belongs (default of
public, meaning the public schema).
- intersections() DataFrame¶
Return all the distinct patterns of set membership in this Membership object.
Each element belongs to a unique intersection of, for each set, either the set itself or its complement. For N sets, 2**N such intersections are possible. This function returns all intersections with at least one element.
The value returned is a DataFrame, mapping an ‘intersection id’ to a boolean for each of the original sets, indicating whether an element must be included (True) or excluded (False) from the set to be included in the intersection.
- invert_column_selection(selection: Sequence | None) Sequence | None¶
Invert a selection of column names
The result is a selection containing all column names, except those contained in selection.
- invert_intersection_selection(selection: Sequence[int] | None) Sequence[int] | None¶
Invert a selection of intersection ids
The result is a selection containing all intersection ids, except those contained in selection.
- invert_record_selection(selection: Sequence[int] | None) Sequence[int] | None¶
Invert a selection of record ids
The result is a selection containing all record ids, except those contained in selection.
- matching_intersections(intersection_spec: SetExpr) ndarray¶
Return all intersections (by intersection id) that are included in intersection_spec
- matching_records(intersection_spec: SetExpr) ndarray¶
Indicate which records are contained in matching intersections
An intersection matches intersection_spec if the latter is a subset of it (or equal to it).
Return a boolean series, which is True for each index where the data matches the given intersection.
- records() ndarray¶
An array containing all record ids
- select_columns(selection: List | None = None) Membership¶
Return a new Membership object for a subset of the columns
A new Membership object is constructed from the given columns :param selection: (which must be a subset of columns of the current object). Intersections that are identical under the selection are consolidated.
It does not make sense to compare intersection ids before and after the selection (which relate to a different number of sets).
- select_intersections(selection: Sequence | None = None) Membership¶
Return a Membership object with the given subset of intersections
A Membership object is returned, based on the given intersections in :param selection: (which must be a subset of intersections of the current object). A selection of None corresponds to every intersection being selected, and the original object is returned.
The id of an intersection in the returned object is the same as in the original object before the selection was taken.
- select_records(selection: Sequence[int] | None = None) Membership¶
Return a Membership object with the given subset of records
A Membership object is returned, but only containing the given records in :param selection: (of course, these must be records present in the current object - it is an error to select a record that isn’t present). A selection of None corresponds to every record being selected, and the original object is returned.
An ‘intersection id’ after the selection is made is consistent with the corresponding intersection before the selection. However, an intersection id before the selection may not be present after the selection (if there are no records in that particular intersection).
- setvis.membership._invert_selection(universe: Sequence, selection: Sequence | None)¶
Invert the given selection (with respect to the universe of possible values)
A selection is either:
a sequence of distinct values from a ‘universe’ of possible values, or
the value None, meaning a selection containing the entire universe.
- setvis.membership.selection_to_series(universe: Sequence, selection: Sequence | None, sort: bool = True)¶
Convert a sequence of values (selection) into a boolean
pd.Seriesindexed by the universe of possible values (as given by universe).If sort is True (the default), the result has its index sorted.