cat2cat ======= .. py:module:: cat2cat Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/cat2cat/cat2cat/index /autoapi/cat2cat/cat2cat_ml/index /autoapi/cat2cat/cat2cat_ml_utils/index /autoapi/cat2cat/cat2cat_utils/index /autoapi/cat2cat/data/index /autoapi/cat2cat/dataclass/index /autoapi/cat2cat/datasets/index /autoapi/cat2cat/mappings/index /autoapi/cat2cat/summary/index Attributes ---------- .. autoapisummary:: cat2cat.__version__ Functions --------- .. autoapisummary:: cat2cat.cat2cat cat2cat.cat2cat_ml_run cat2cat.summary_c2c Package Contents ---------------- .. py:data:: __version__ .. py:function:: cat2cat(data: cat2cat.dataclass.cat2cat_data, mappings: cat2cat.dataclass.cat2cat_mappings, ml: Optional[cat2cat.dataclass.cat2cat_ml] = None) -> Dict[str, pandas.DataFrame] Automatic mapping in a panel dataset - cat2cat procedure :param data: dataclass with data related arguments. Please check out the `cat2cat.dataclass.cat2cat_data` for more information. :type data: cat2cat_data :param mappings: dataclass with mappings related arguments. Please check out the `cat2cat.dataclass.cat2cat_mappings` for more information. :type mappings: cat2cat_mappings :param ml: dataclass with ml related arguments. Please check out the `cat2cat.dataclass.cat2cat_ml` for more information. :type ml: Optional[cat2cat_ml] :returns: with 2 DataFrames, old and new. There will be added additional columns like index_c2c, g_new_c2c, wei_freq_c2c, rep_c2c, wei_(ml method name)_c2c. Additional columns will be informative only for a one DataFrame as we always make the changes to one direction. :rtype: dict .. note:: 1. Without ml section only simple frequencies are assessed. When ml model is broken then weights from simple frequencies are taken. `knn` method is recommended for smaller datasets. 2. Please be sure that the categorical variable is of the same type in all places. `mappings.trans` arg columns and the `data.cat_var` column have to be of the same type. When ml part is applied then `ml.cat_var` has to have the same type too. Changes have to be made at the same time for the mapping table and datasets. 3. Missing values in the mapping table or categorical variable can cause problems. It is recommended to use string or float types in the mapping table and for categorical variable. Alternative solution can be representing missing values as a specific number (9999) or string ("Missing"). >>> from cat2cat import cat2cat >>> from cat2cat.dataclass import cat2cat_data, cat2cat_mappings, cat2cat_ml >>> from sklearn.ensemble import RandomForestClassifier >>> from cat2cat.datasets import load_trans, load_occup >>> trans = load_trans() >>> occup = load_occup() >>> o_old = occup.loc[occup.year == 2008, :].copy() >>> o_new = occup.loc[occup.year == 2010, :].copy() >>> data = cat2cat_data(old = o_old, new = o_new, cat_var_old = "code", ... cat_var_new = "code", time_var = "year") >>> mappings = cat2cat_mappings(trans = trans, direction = "forward") >>> cat2cat(data = data, mappings = mappings) {... .. py:function:: cat2cat_ml_run(mappings: cat2cat.dataclass.cat2cat_mappings, ml: cat2cat.dataclass.cat2cat_ml, **kwargs: Any) -> cat2cat_ml_run_results Run model diagnostics before using ML-based cat2cat weights. This helper evaluates baseline and model-based classification quality within each mapping group and aggregates summary statistics across groups. :param mappings: Mapping configuration created with ``cat2cat_mappings``. :param ml: ML configuration created with ``cat2cat_ml``. :param \*\*kwargs: Optional diagnostics settings: - ``test_prop`` (float): test split proportion in ``(0, 1)``. Default is ``0.2``. - ``split_seed`` (int): random seed for train/test split. Default is ``42``. - ``min_match`` (float): minimum fraction of records in ``ml.data`` whose category appears in the mapping table. Must be in ``[0, 1)``. Default is ``0.8``. :returns: object with per-group raw diagnostics and aggregated metrics such as mean accuracy, mean Brier score, mean P(true class), failure rates, and model-vs-baseline comparisons. :rtype: cat2cat_ml_run_results :raises TypeError: if ``mappings`` or ``ml`` has invalid type. :raises ValueError: if kwargs names/ranges are invalid or mapping coverage is below ``min_match``. .. rubric:: Examples >>> from sklearn.ensemble import RandomForestClassifier >>> from cat2cat import cat2cat_ml_run >>> from cat2cat.dataclass import cat2cat_mappings, cat2cat_ml >>> from cat2cat.datasets import load_trans, load_occup >>> trans = load_trans() >>> occup = load_occup() >>> data_2010 = occup.loc[occup.year == 2010, :].copy() >>> mappings = cat2cat_mappings(trans, "backward") >>> ml = cat2cat_ml( ... data=data_2010, ... cat_var="code", ... features=["salary", "age", "edu", "sex"], ... models=[RandomForestClassifier(n_estimators=50, random_state=1234)], ... ) >>> out = cat2cat_ml_run(mappings=mappings, ml=ml, test_prop=0.2) >>> hasattr(out, "mean_acc") True .. py:function:: summary_c2c(model: Any, df_old: float, df_new: Optional[float] = None) -> pandas.DataFrame Adjust regression summaries fitted on replicated cat2cat data. :param model: A fitted statsmodels-like result object with ``params``, ``bse``, and ``tvalues`` attributes. :param df_old: Residual degrees of freedom on the original observation scale. :param df_new: Residual degrees of freedom on the replicated data scale. Defaults to ``model.df_resid``. :returns: coefficient table with corrected standard errors, corrected statistics, corrected p-values, and reference distribution. :rtype: pandas.DataFrame .. rubric:: Examples >>> from pandas import DataFrame, concat >>> import statsmodels.api as sm >>> from cat2cat import summary_c2c >>> data = DataFrame({ ... "y": [2.0, 3.0, 5.0, 7.0, 11.0, 13.0, 17.0, 19.0], ... "x1": [1.0, 1.5, 2.0, 2.7, 3.2, 4.1, 4.8, 5.2], ... "x2": [0, 1, 0, 1, 0, 1, 0, 1], ... }) >>> model = sm.OLS.from_formula("y ~ x1 + x2", data=data).fit() >>> model_rep = sm.OLS.from_formula( ... "y ~ x1 + x2", data=concat([data, data]) ... ).fit() >>> out = summary_c2c(model_rep, df_old=model.df_resid, df_new=model_rep.df_resid) >>> all(col in out.columns for col in ["std.error_c", "statistic_c", "p.value_c"]) True