cat2cat
Submodules
Attributes
Functions
|
Automatic mapping in a panel dataset - cat2cat procedure |
|
Run model diagnostics before using ML-based cat2cat weights. |
|
Adjust regression summaries fitted on replicated cat2cat data. |
Package Contents
- cat2cat.__version__
- cat2cat.cat2cat(data: cat2cat.dataclass.cat2cat_data, mappings: cat2cat.dataclass.cat2cat_mappings, ml: cat2cat.dataclass.cat2cat_ml | None = None) Dict[str, pandas.DataFrame]
Automatic mapping in a panel dataset - cat2cat procedure
- Parameters:
data (cat2cat_data) – dataclass with data related arguments. Please check out the cat2cat.dataclass.cat2cat_data for more information.
mappings (cat2cat_mappings) – dataclass with mappings related arguments. Please check out the cat2cat.dataclass.cat2cat_mappings for more information.
ml (Optional[cat2cat_ml]) – dataclass with ml related arguments. Please check out the cat2cat.dataclass.cat2cat_ml for more information.
- Returns:
with 2 DataFrames, old and new. There will be added additional columns like index_c2c, g_new_c2c, wei_freq_c2c, rep_c2c, wei_(ml method name)_c2c. Additional columns will be informative only for a one DataFrame as we always make the changes to one direction.
- Return type:
dict
Note
1. Without ml section only simple frequencies are assessed. When ml model is broken then weights from simple frequencies are taken. knn method is recommended for smaller datasets.
2. Please be sure that the categorical variable is of the same type in all places. mappings.trans arg columns and the data.cat_var column have to be of the same type. When ml part is applied then ml.cat_var has to have the same type too. Changes have to be made at the same time for the mapping table and datasets.
3. Missing values in the mapping table or categorical variable can cause problems. It is recommended to use string or float types in the mapping table and for categorical variable. Alternative solution can be representing missing values as a specific number (9999) or string (“Missing”).
>>> from cat2cat import cat2cat >>> from cat2cat.dataclass import cat2cat_data, cat2cat_mappings, cat2cat_ml >>> from sklearn.ensemble import RandomForestClassifier >>> from cat2cat.datasets import load_trans, load_occup >>> trans = load_trans() >>> occup = load_occup() >>> o_old = occup.loc[occup.year == 2008, :].copy() >>> o_new = occup.loc[occup.year == 2010, :].copy() >>> data = cat2cat_data(old = o_old, new = o_new, cat_var_old = "code", ... cat_var_new = "code", time_var = "year") >>> mappings = cat2cat_mappings(trans = trans, direction = "forward") >>> cat2cat(data = data, mappings = mappings) {...
- cat2cat.cat2cat_ml_run(mappings: cat2cat.dataclass.cat2cat_mappings, ml: cat2cat.dataclass.cat2cat_ml, **kwargs: Any) cat2cat_ml_run_results
Run model diagnostics before using ML-based cat2cat weights.
This helper evaluates baseline and model-based classification quality within each mapping group and aggregates summary statistics across groups.
- Parameters:
mappings – Mapping configuration created with
cat2cat_mappings.ml – ML configuration created with
cat2cat_ml.**kwargs –
Optional diagnostics settings: -
test_prop(float): test split proportion in(0, 1).Default is
0.2.split_seed(int): random seed for train/test split. Default is42.min_match(float): minimum fraction of records inml.datawhose category appears in the mapping table. Must be in[0, 1). Default is0.8.
- Returns:
object with per-group raw diagnostics and aggregated metrics such as mean accuracy, mean Brier score, mean P(true class), failure rates, and model-vs-baseline comparisons.
- Return type:
cat2cat_ml_run_results
- Raises:
TypeError – if
mappingsormlhas invalid type.ValueError – if kwargs names/ranges are invalid or mapping coverage is below
min_match.
Examples
>>> from sklearn.ensemble import RandomForestClassifier >>> from cat2cat import cat2cat_ml_run >>> from cat2cat.dataclass import cat2cat_mappings, cat2cat_ml >>> from cat2cat.datasets import load_trans, load_occup >>> trans = load_trans() >>> occup = load_occup() >>> data_2010 = occup.loc[occup.year == 2010, :].copy() >>> mappings = cat2cat_mappings(trans, "backward") >>> ml = cat2cat_ml( ... data=data_2010, ... cat_var="code", ... features=["salary", "age", "edu", "sex"], ... models=[RandomForestClassifier(n_estimators=50, random_state=1234)], ... ) >>> out = cat2cat_ml_run(mappings=mappings, ml=ml, test_prop=0.2) >>> hasattr(out, "mean_acc") True
- cat2cat.summary_c2c(model: Any, df_old: float, df_new: float | None = None) pandas.DataFrame
Adjust regression summaries fitted on replicated cat2cat data.
- Parameters:
model – A fitted statsmodels-like result object with
params,bse, andtvaluesattributes.df_old – Residual degrees of freedom on the original observation scale.
df_new – Residual degrees of freedom on the replicated data scale. Defaults to
model.df_resid.
- Returns:
coefficient table with corrected standard errors, corrected statistics, corrected p-values, and reference distribution.
- Return type:
pandas.DataFrame
Examples
>>> from pandas import DataFrame, concat >>> import statsmodels.api as sm >>> from cat2cat import summary_c2c >>> data = DataFrame({ ... "y": [2.0, 3.0, 5.0, 7.0, 11.0, 13.0, 17.0, 19.0], ... "x1": [1.0, 1.5, 2.0, 2.7, 3.2, 4.1, 4.8, 5.2], ... "x2": [0, 1, 0, 1, 0, 1, 0, 1], ... }) >>> model = sm.OLS.from_formula("y ~ x1 + x2", data=data).fit() >>> model_rep = sm.OLS.from_formula( ... "y ~ x1 + x2", data=concat([data, data]) ... ).fit() >>> out = summary_c2c(model_rep, df_old=model.df_resid, df_new=model_rep.df_resid) >>> all(col in out.columns for col in ["std.error_c", "statistic_c", "p.value_c"]) True