Setvis fundamentals and the Membership class

class setvis.membership.Col(name: str)

An alias for Set, used when referring to a ‘column’ (hence, Col) rather than a ‘set’ is more natural, for example, when referring to the set of elements with missing data in a particular column.

class setvis.membership.Membership(intersection_id_to_columns: DataFrame, intersection_id_to_records: DataFrame, set_mode: bool = False, check: bool = True)

A storage-efficient representation of a collection of sets and the elements that belong to them (including their various intersections). The representation is optimised for particular queries related to membership.

An interpretation of sets as ‘columns’ and elements as boolean-valued ‘records’ is often useful, so sets/columns and elements/records are used interchangeably.

Using one of the various named constructors (Membership.from_data_frame(), Membership.from_csv(), Membership.from_membership_data_frame() or Membership.from_membership_csv()) is the preferred way to construct a Membership object.

If directly using the default constructor, the internal representation of the data is passed as two pandas dataframes, with these and the other arguments as given below.

Parameters:
  • intersection_id_to_columns (pd.DataFrame) – A dataframe with as many rows as there are unique intersections (patterns of membership or missingness), and a column for each set (or column in the original dataframe)

  • intersection_id_to_records (pd.DataFrame) – A dataframe with as many rows as there are records. The index of this DataFrame must be named “intersection_id” with a foreign key relationship to intersection_id_to_columns. It doesn’t have to be a unique index (and generally won’t be). This dataframe otherwise has a single column

  • check (bool) – Check that the internal representation satisfies the required invariants.

columns() List[Any]

The names of the sets included in the Membership object (or equivalently, the ‘columns’).

count_intersections() DataFrame

Distinct set intersections, and the count of each

Parameters:

count_col_name – The name of the column in the result holding the count data

Returns:

A dataframe containing the intersections and the number of times each appears in the dataset

count_matching_records(intersection_spec: SetExpr) int

Equivalent to len(self.matching_records(intersection_spec)), but could have a performance benefit

drop_columns(selection: List | None) Membership

Return a Membership object excluding the given column selection

See also

select_columns

drop_intersections(selection: Sequence[int] | None) Membership

Return a Membership object excluding the given intersection ids

drop_records(selection: Sequence[int] | None) Membership

Return a Membership object excluding the given record ids

See also

select_records

empty_intersection() Series

A helper function that returns a boolean-valued pandas Series indicating the ‘empty’ intersection (True for this intersection_id, False for all others).

The ‘empty’ intersection is the unique intersection excluding all of the original sets.

classmethod from_csv(filepath_or_buffer, read_csv_args=None, **kwargs) Membership

Construct a Membership object from a csv file

The data is first loading into a pandas dataframe and passed to from_data_frame().

Parameters:
  • filepath_or_buffer (file-like) – The file-like to load with pandas.read_csv()

  • read_csv_args (Dict) – A dictionary of keyword arguments forwarded to pandas.read_csv()

  • **kwargs – All other arguments are forwarded to from_data_frame()

See also

from_data_frame

Load data from a data frame

classmethod from_data_frame(df: ~pandas.core.frame.DataFrame, is_missing: ~typing.Callable[[~typing.Any], bool] = <function isna>, set_mode: bool = False)

Construct a Membership object from a dataframe

This function produces a Membership object from a dataframe, interpretting the records according to one of two modes, ‘missingness mode’ or ‘set mode’ (see below).

In either case, the sets under consideration are named by the columns of the data frame, and the elements belonging to one or more of them indicated by the records in the dataframe, with each element named after the index in the dataframe.

How a record is interpretted as belonging to a particular set differs between the two modes.

In missingness mode (the default mode, enabled explicitly by passing set_mode=False), a record belongs to the set named after a column, if and only if evaluating is_missing on the entry in that column is truthy. The sets in this mode are used to capture the missingness of data by column. The particular intersection a record belongs corresponds to the pattern of missingness (which columns contain missing data) in that record.

Example

As an example of using missingness mode, consider a dataframe df, given by

(index)

variety

length

0

NA

5.0

1

“green”

NA

2

NA

1.0

3

“orange”

2.5

There are three missing values in this dataframe, indicated by NA.

Membership.from_data_frame(df) is an object representing the missing data. This is equivalent to the data in the following dataframe (although it isn’t represented in quite this way):

(index)

variety missing?

length missing?

0

True

False

1

False

True

2

True

False

3

False

False

The mapping from the values in the dataframe to booleans is controlled by the is_missing argument.

We observe that:

  • The records 0 and 2 have the same missingness pattern (of (True, False))

  • Record 3 has no missing elements

and so on. The other methods on this class can be used to answer such queries efficiently.

In set mode (enabled by passing set_mode=True), the membership of a record to a set is indicated by the (boolean) value in the corresponding column.

Note

Set mode is in fact equivalent to passing is_missing = lambda r: r.astype(bool) in ‘missingness mode’ on a subset of columns whose names begin with ‘category@’.

Example

As an example of using set mode, consider a dataframe df, given by

(index)

category@A

category@B

category@C

0

True

False

True

1

True

False

False

2

True

False

True

3

False

False

False

We can construct a Membership object from this dataframe as:

Membership.from_data_frame(df, set_mode=True)

In this example, there are four records (labelled 0–3 by the index column), and three sets (A, B and C).

We can observe the following:

  • Record 0 is an element of both A and C, but not B

  • Record 0 is therefore an element of the intersection of A, C and the complement of B

  • Record 2 belongs to the same ‘intersection’ as record 0

  • Set B doesn’t contain any elements (it is empty)

  • Record 3 isn’t a member of any of the sets. It is member of the ‘empty’ intersection (the intersection of the complements of A, B and C)

Parameters:
  • df (pd.DataFrame) – The input dataframe

  • is_missing (any Callable, that returns a boolean) – A predicate used in ‘missingness mode’, to determine if an element of the dataframe is missing. The default is pd.isnull

  • set_mode (bool) – The dataframe is interpreted as set_mode (True) or missingness mode (False)

See also

from_membership_data_frame

for loading membership information from a dataframe with a different format

classmethod from_membership_csv(filepath_or_buffer, read_csv_args=None, **kwargs) Membership

Construct a Membership object from a ‘membership’ csv file

The data is first loading into a pandas dataframe and passed to from_membership_data_frame().

Parameters:
  • filepath_or_buffer (file-like) – The file-like to load with pandas.read_csv()

  • read_csv_args (Dict) – A dictionary of keyword arguments forwarded to pandas.read_csv()

  • **kwargs – All other keyword arguments are forwarded to from_membership_data_frame()

classmethod from_membership_data_frame(df, membership_column='set_membership', membership_separator='|', **kwargs) Membership

Construct a Membership object from a ‘membership’ dataframe

A membership dataframe is any dataframe with a column indicating membership of named sets, in a particular format.

The column is by default named ‘set_membership’ (but this can be specified with membership_column). Other columns are ignored.

An entry in the membership column must be a string, to be interpreted as the name of the sets that a record belongs to, separated by “|” (the default, changed with the membership_separator argument)

Additional keyword arguments are forwarded to the class constructor.

Example

df = pd.DataFrame({
    "set_membership": ["A", "A,B,C", "B,C", ""]
})

df_membership = Membership.from_membership_data_frame(
    df, membership_seperator=","
)

After running the above, df is

(index)

set_membership

0

“A”

1

“A,B,C”

2

“B,C”

3

and df_membership is an object representing three sets, A (containing records 0 and 1), B (containing 1 and 2) and C (also containing 1 and 2). Record 3 is not a member of any of the sets.

Note

This constructor is careful never to produce a ‘dense’ dataframe, so can be useful when there are many sparse sets to avoid constructing a large intermediate dataframe.

classmethod from_postgres(conn, relation: str, key: str, schema: str | None = 'public') Membership

Construct a Membership object from a postgres database connection.

Currently, membership is determined as with ‘missingness mode’ (see from_dataframe()). An entry in the table is considered missing if it is NULL.

Note

This constructor requires psycopg2 to be installed.

Note

The database connection must have permission to create temporary tables.

Parameters:
  • conn (pyscopg2 connection object) – the database connection

  • relation (str) – the name of the relation (table) in the database from which to load data

  • key (str) – the name of the column to use as the ‘record id’ (must be a unique key)

  • schema (Optional[str]) – the name of the schema to which the relation belongs (default of public, meaning the public schema).

intersections() DataFrame

Return all the distinct patterns of set membership in this Membership object.

Each element belongs to a unique intersection of, for each set, either the set itself or its complement. For N sets, 2**N such intersections are possible. This function returns all intersections with at least one element.

The value returned is a DataFrame, mapping an ‘intersection id’ to a boolean for each of the original sets, indicating whether an element must be included (True) or excluded (False) from the set to be included in the intersection.

invert_column_selection(selection: Sequence | None) Sequence | None

Invert a selection of column names

The result is a selection containing all column names, except those contained in selection.

invert_intersection_selection(selection: Sequence[int] | None) Sequence[int] | None

Invert a selection of intersection ids

The result is a selection containing all intersection ids, except those contained in selection.

invert_record_selection(selection: Sequence[int] | None) Sequence[int] | None

Invert a selection of record ids

The result is a selection containing all record ids, except those contained in selection.

matching_intersections(intersection_spec: SetExpr) ndarray

Return all intersections (by intersection id) that are included in intersection_spec

matching_records(intersection_spec: SetExpr) ndarray

Indicate which records are contained in matching intersections

An intersection matches intersection_spec if the latter is a subset of it (or equal to it).

Return a boolean series, which is True for each index where the data matches the given intersection.

records() ndarray

An array containing all record ids

select_columns(selection: List | None = None) Membership

Return a new Membership object for a subset of the columns

A new Membership object is constructed from the given columns :param selection: (which must be a subset of columns of the current object). Intersections that are identical under the selection are consolidated.

It does not make sense to compare intersection ids before and after the selection (which relate to a different number of sets).

select_intersections(selection: Sequence | None = None) Membership

Return a Membership object with the given subset of intersections

A Membership object is returned, based on the given intersections in :param selection: (which must be a subset of intersections of the current object). A selection of None corresponds to every intersection being selected, and the original object is returned.

The id of an intersection in the returned object is the same as in the original object before the selection was taken.

select_records(selection: Sequence[int] | None = None) Membership

Return a Membership object with the given subset of records

A Membership object is returned, but only containing the given records in :param selection: (of course, these must be records present in the current object - it is an error to select a record that isn’t present). A selection of None corresponds to every record being selected, and the original object is returned.

An ‘intersection id’ after the selection is made is consistent with the corresponding intersection before the selection. However, an intersection id before the selection may not be present after the selection (if there are no records in that particular intersection).

setvis.membership._invert_selection(universe: Sequence, selection: Sequence | None)

Invert the given selection (with respect to the universe of possible values)

A selection is either:

  • a sequence of distinct values from a ‘universe’ of possible values, or

  • the value None, meaning a selection containing the entire universe.

setvis.membership.selection_to_series(universe: Sequence, selection: Sequence | None, sort: bool = True)

Convert a sequence of values (selection) into a boolean pd.Series indexed by the universe of possible values (as given by universe).

If sort is True (the default), the result has its index sorted.