cat2cat.cat2cat_utils

Functions

`prune_c2c`(→ pandas.DataFrame)	Pruning which could be useful after the mapping process
`dummy_c2c`(→ pandas.DataFrame)	Add default cat2cat columns to a data.frame

Module Contents

cat2cat.cat2cat_utils.prune_c2c(df: pandas.DataFrame, prune_fun: Callable[[numpy.ndarray], numpy.ndarray], wei_var: str = 'wei_freq_c2c', index_var: str = 'index_c2c', inplace: bool = False) → pandas.DataFrame

Pruning which could be useful after the mapping process

Parameters:

df (DataFrame) – a specific period from the cat2cat function result.
prune_fun (callable) – a function to process a 1D-array of weights (float) and return a 1D-array of boolean of the same length. The weighs will be reweighted automatically to still to sum to one per each original observation.
wei_var (str) – By default “wei_freq_c2c”.
index_var (str) – By default “index_c2c”.
inplace (bool) – Whether to perform the operation inplace. By default False.

Returns:

df argument with possibly reduced number of rows.

Return type:

DataFrame

Note

non-zero prune_fun - lambda x: x > 0
highest1 prune_fun - lambda x: arange(len(x)) == argmax(x)
highest prune_fun - lambda x: x == max(x)

>>> from cat2cat import cat2cat
>>> from cat2cat.dataclass import cat2cat_data, cat2cat_mappings, cat2cat_ml
>>> from sklearn.ensemble import RandomForestClassifier
>>> from cat2cat.datasets import load_trans, load_occup
>>> trans = load_trans()
>>> occup = load_occup()
>>> o_old = occup.loc[occup.year == 2008, :].copy()
>>> o_new = occup.loc[occup.year == 2010, :].copy()
>>> data_c2c = cat2cat_data(o_old, o_new, "code", "code", "year")
>>> mappings_c2c = cat2cat_mappings(trans, "forward")
>>> c2c = cat2cat(data_c2c, mappings_c2c)
>>> #
>>> # non-zero - lambda x: x > 0
>>> # highest1 - lambda x: arange(len(x)) == argmax(x)
>>> # highest - lambda x: x == max(x)
>>> #
>>> # non-zero
>>> prune_c2c(c2c["old"], lambda x: x > 0)
          id        age    sex  edu        exp  ...  index_c2c  g_new_c2c  rep_c2c wei_naive_c2c  wei_freq_c2c
...

cat2cat.cat2cat_utils.dummy_c2c(df: pandas.DataFrame, cat_var: str, models: Sequence | None = None, inplace: bool = False) → pandas.DataFrame

Add default cat2cat columns to a data.frame

The function is useful to achive consitent columns across all panel periods, even for ones for which cat2cat procedure was not applied.

Parameters:

df (DataFrame) – a specific period from the cat2cat function result.
cat_car (str) – name of categorial variable
models (Optional[Sequence]) – an optional list of str, ml models applied (class name). By default turn off, equal None.
inplace (bool) – Whether to perform the operation inplace. By default False.

Returns:

df arg DataFrame but with additional columns connected with cat2cat procedure. The base added columns if not already exist: index_c2c, g_new_c2c, rep_c2c, wei_naive_c2c, wei_freq_c2c. Additionaly ml models connected columns like wei_MLNAME_c2c.

Return type:

DataFrame