cat2cat

Submodules

Attributes

__version__

Functions

cat2cat(→ Dict[str, pandas.DataFrame])

Automatic mapping in a panel dataset - cat2cat procedure

cat2cat_ml_run(→ cat2cat_ml_run_results)

Run model diagnostics before using ML-based cat2cat weights.

summary_c2c(→ pandas.DataFrame)

Adjust regression summaries fitted on replicated cat2cat data.

Package Contents

cat2cat.__version__
cat2cat.cat2cat(data: cat2cat.dataclass.cat2cat_data, mappings: cat2cat.dataclass.cat2cat_mappings, ml: cat2cat.dataclass.cat2cat_ml | None = None) Dict[str, pandas.DataFrame]

Automatic mapping in a panel dataset - cat2cat procedure

Parameters:
  • data (cat2cat_data) – dataclass with data related arguments. Please check out the cat2cat.dataclass.cat2cat_data for more information.

  • mappings (cat2cat_mappings) – dataclass with mappings related arguments. Please check out the cat2cat.dataclass.cat2cat_mappings for more information.

  • ml (Optional[cat2cat_ml]) – dataclass with ml related arguments. Please check out the cat2cat.dataclass.cat2cat_ml for more information.

Returns:

with 2 DataFrames, old and new. There will be added additional columns like index_c2c, g_new_c2c, wei_freq_c2c, rep_c2c, wei_(ml method name)_c2c. Additional columns will be informative only for a one DataFrame as we always make the changes to one direction.

Return type:

dict

Note

1. Without ml section only simple frequencies are assessed. When ml model is broken then weights from simple frequencies are taken. knn method is recommended for smaller datasets.

2. Please be sure that the categorical variable is of the same type in all places. mappings.trans arg columns and the data.cat_var column have to be of the same type. When ml part is applied then ml.cat_var has to have the same type too. Changes have to be made at the same time for the mapping table and datasets.

3. Missing values in the mapping table or categorical variable can cause problems. It is recommended to use string or float types in the mapping table and for categorical variable. Alternative solution can be representing missing values as a specific number (9999) or string (“Missing”).

>>> from cat2cat import cat2cat
>>> from cat2cat.dataclass import cat2cat_data, cat2cat_mappings, cat2cat_ml
>>> from sklearn.ensemble import RandomForestClassifier
>>> from cat2cat.datasets import load_trans, load_occup
>>> trans = load_trans()
>>> occup = load_occup()
>>> o_old = occup.loc[occup.year == 2008, :].copy()
>>> o_new = occup.loc[occup.year == 2010, :].copy()
>>> data = cat2cat_data(old = o_old, new = o_new, cat_var_old = "code",
...                     cat_var_new = "code", time_var = "year")
>>> mappings = cat2cat_mappings(trans = trans, direction = "forward")
>>> cat2cat(data = data, mappings = mappings)
{...
cat2cat.cat2cat_ml_run(mappings: cat2cat.dataclass.cat2cat_mappings, ml: cat2cat.dataclass.cat2cat_ml, **kwargs: Any) cat2cat_ml_run_results

Run model diagnostics before using ML-based cat2cat weights.

This helper evaluates baseline and model-based classification quality within each mapping group and aggregates summary statistics across groups.

Parameters:
  • mappings – Mapping configuration created with cat2cat_mappings.

  • ml – ML configuration created with cat2cat_ml.

  • **kwargs

    Optional diagnostics settings: - test_prop (float): test split proportion in (0, 1).

    Default is 0.2.

    • split_seed (int): random seed for train/test split. Default is 42.

    • min_match (float): minimum fraction of records in ml.data whose category appears in the mapping table. Must be in [0, 1). Default is 0.8.

Returns:

object with per-group raw diagnostics and aggregated metrics such as mean accuracy, mean Brier score, mean P(true class), failure rates, and model-vs-baseline comparisons.

Return type:

cat2cat_ml_run_results

Raises:
  • TypeError – if mappings or ml has invalid type.

  • ValueError – if kwargs names/ranges are invalid or mapping coverage is below min_match.

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>> from cat2cat import cat2cat_ml_run
>>> from cat2cat.dataclass import cat2cat_mappings, cat2cat_ml
>>> from cat2cat.datasets import load_trans, load_occup
>>> trans = load_trans()
>>> occup = load_occup()
>>> data_2010 = occup.loc[occup.year == 2010, :].copy()
>>> mappings = cat2cat_mappings(trans, "backward")
>>> ml = cat2cat_ml(
...     data=data_2010,
...     cat_var="code",
...     features=["salary", "age", "edu", "sex"],
...     models=[RandomForestClassifier(n_estimators=50, random_state=1234)],
... )
>>> out = cat2cat_ml_run(mappings=mappings, ml=ml, test_prop=0.2)
>>> hasattr(out, "mean_acc")
True
cat2cat.summary_c2c(model: Any, df_old: float, df_new: float | None = None) pandas.DataFrame

Adjust regression summaries fitted on replicated cat2cat data.

Parameters:
  • model – A fitted statsmodels-like result object with params, bse, and tvalues attributes.

  • df_old – Residual degrees of freedom on the original observation scale.

  • df_new – Residual degrees of freedom on the replicated data scale. Defaults to model.df_resid.

Returns:

coefficient table with corrected standard errors, corrected statistics, corrected p-values, and reference distribution.

Return type:

pandas.DataFrame

Examples

>>> from pandas import DataFrame, concat
>>> import statsmodels.api as sm
>>> from cat2cat import summary_c2c
>>> data = DataFrame({
...     "y": [2.0, 3.0, 5.0, 7.0, 11.0, 13.0, 17.0, 19.0],
...     "x1": [1.0, 1.5, 2.0, 2.7, 3.2, 4.1, 4.8, 5.2],
...     "x2": [0, 1, 0, 1, 0, 1, 0, 1],
... })
>>> model = sm.OLS.from_formula("y ~ x1 + x2", data=data).fit()
>>> model_rep = sm.OLS.from_formula(
...     "y ~ x1 + x2", data=concat([data, data])
... ).fit()
>>> out = summary_c2c(model_rep, df_old=model.df_resid, df_new=model_rep.df_resid)
>>> all(col in out.columns for col in ["std.error_c", "statistic_c", "p.value_c"])
True