cat2cat.cat2cat_utils ===================== .. py:module:: cat2cat.cat2cat_utils Functions --------- .. autoapisummary:: cat2cat.cat2cat_utils.prune_c2c cat2cat.cat2cat_utils.dummy_c2c Module Contents --------------- .. py:function:: prune_c2c(df: pandas.DataFrame, prune_fun: Callable[[numpy.ndarray], numpy.ndarray], wei_var: str = 'wei_freq_c2c', index_var: str = 'index_c2c', inplace: bool = False) -> pandas.DataFrame Pruning which could be useful after the mapping process :param df: a specific period from the cat2cat function result. :type df: DataFrame :param prune_fun: a function to process a 1D-array of weights (float) and return a 1D-array of boolean of the same length. The weighs will be reweighted automatically to still to sum to one per each original observation. :type prune_fun: callable :param wei_var: By default "wei_freq_c2c". :type wei_var: str :param index_var: By default "index_c2c". :type index_var: str :param inplace: Whether to perform the operation inplace. By default False. :type inplace: bool :returns: df argument with possibly reduced number of rows. :rtype: DataFrame .. note:: - non-zero prune_fun - lambda x: x > 0 - highest1 prune_fun - lambda x: arange(len(x)) == argmax(x) - highest prune_fun - lambda x: x == max(x) >>> from cat2cat import cat2cat >>> from cat2cat.dataclass import cat2cat_data, cat2cat_mappings, cat2cat_ml >>> from sklearn.ensemble import RandomForestClassifier >>> from cat2cat.datasets import load_trans, load_occup >>> trans = load_trans() >>> occup = load_occup() >>> o_old = occup.loc[occup.year == 2008, :].copy() >>> o_new = occup.loc[occup.year == 2010, :].copy() >>> data_c2c = cat2cat_data(o_old, o_new, "code", "code", "year") >>> mappings_c2c = cat2cat_mappings(trans, "forward") >>> c2c = cat2cat(data_c2c, mappings_c2c) >>> # >>> # non-zero - lambda x: x > 0 >>> # highest1 - lambda x: arange(len(x)) == argmax(x) >>> # highest - lambda x: x == max(x) >>> # >>> # non-zero >>> prune_c2c(c2c["old"], lambda x: x > 0) id age sex edu exp ... index_c2c g_new_c2c rep_c2c wei_naive_c2c wei_freq_c2c ... .. py:function:: dummy_c2c(df: pandas.DataFrame, cat_var: str, models: Optional[Sequence] = None, inplace: bool = False) -> pandas.DataFrame Add default cat2cat columns to a `data.frame` The function is useful to achive consitent columns across all panel periods, even for ones for which cat2cat procedure was not applied. :param df: a specific period from the cat2cat function result. :type df: DataFrame :param cat_car: name of categorial variable :type cat_car: str :param models: an optional list of str, ml models applied (class name). By default turn off, equal None. :type models: Optional[Sequence] :param inplace: Whether to perform the operation inplace. By default False. :type inplace: bool :returns: df arg DataFrame but with additional columns connected with cat2cat procedure. The base added columns if not already exist: index_c2c, g_new_c2c, rep_c2c, wei_naive_c2c, wei_freq_c2c. Additionaly ml models connected columns like wei_MLNAME_c2c. :rtype: DataFrame