cat2cat.cat2cat_utils
Functions
|
Pruning which could be useful after the mapping process |
|
Add default cat2cat columns to a data.frame |
Module Contents
- cat2cat.cat2cat_utils.prune_c2c(df: pandas.DataFrame, prune_fun: Callable[[numpy.ndarray], numpy.ndarray], wei_var: str = 'wei_freq_c2c', index_var: str = 'index_c2c', inplace: bool = False) pandas.DataFrame
Pruning which could be useful after the mapping process
- Parameters:
df (DataFrame) – a specific period from the cat2cat function result.
prune_fun (callable) – a function to process a 1D-array of weights (float) and return a 1D-array of boolean of the same length. The weighs will be reweighted automatically to still to sum to one per each original observation.
wei_var (str) – By default “wei_freq_c2c”.
index_var (str) – By default “index_c2c”.
inplace (bool) – Whether to perform the operation inplace. By default False.
- Returns:
df argument with possibly reduced number of rows.
- Return type:
DataFrame
Note
non-zero prune_fun - lambda x: x > 0
highest1 prune_fun - lambda x: arange(len(x)) == argmax(x)
highest prune_fun - lambda x: x == max(x)
>>> from cat2cat import cat2cat >>> from cat2cat.dataclass import cat2cat_data, cat2cat_mappings, cat2cat_ml >>> from sklearn.ensemble import RandomForestClassifier >>> from cat2cat.datasets import load_trans, load_occup >>> trans = load_trans() >>> occup = load_occup() >>> o_old = occup.loc[occup.year == 2008, :].copy() >>> o_new = occup.loc[occup.year == 2010, :].copy() >>> data_c2c = cat2cat_data(o_old, o_new, "code", "code", "year") >>> mappings_c2c = cat2cat_mappings(trans, "forward") >>> c2c = cat2cat(data_c2c, mappings_c2c) >>> # >>> # non-zero - lambda x: x > 0 >>> # highest1 - lambda x: arange(len(x)) == argmax(x) >>> # highest - lambda x: x == max(x) >>> # >>> # non-zero >>> prune_c2c(c2c["old"], lambda x: x > 0) id age sex edu exp ... index_c2c g_new_c2c rep_c2c wei_naive_c2c wei_freq_c2c ...
- cat2cat.cat2cat_utils.dummy_c2c(df: pandas.DataFrame, cat_var: str, models: Sequence | None = None, inplace: bool = False) pandas.DataFrame
Add default cat2cat columns to a data.frame
The function is useful to achive consitent columns across all panel periods, even for ones for which cat2cat procedure was not applied.
- Parameters:
df (DataFrame) – a specific period from the cat2cat function result.
cat_car (str) – name of categorial variable
models (Optional[Sequence]) – an optional list of str, ml models applied (class name). By default turn off, equal None.
inplace (bool) – Whether to perform the operation inplace. By default False.
- Returns:
df arg DataFrame but with additional columns connected with cat2cat procedure. The base added columns if not already exist: index_c2c, g_new_c2c, rep_c2c, wei_naive_c2c, wei_freq_c2c. Additionaly ml models connected columns like wei_MLNAME_c2c.
- Return type:
DataFrame