cat2cat

Submodules

Attributes

__version__

Functions

`cat2cat`(→ Dict[str, pandas.DataFrame])	Automatic mapping in a panel dataset - cat2cat procedure
`cat2cat_ml_run`(→ cat2cat_ml_run_results)	Automatic mapping in a panel dataset - cat2cat procedure

Package Contents

cat2cat.__version__

cat2cat.cat2cat(data: cat2cat.dataclass.cat2cat_data, mappings: cat2cat.dataclass.cat2cat_mappings, ml: cat2cat.dataclass.cat2cat_ml | None = None) → Dict[str, pandas.DataFrame]

Automatic mapping in a panel dataset - cat2cat procedure

Parameters:

data (cat2cat_data) – dataclass with data related arguments. Please check out the cat2cat.dataclass.cat2cat_data for more information.
mappings (cat2cat_mappings) – dataclass with mappings related arguments. Please check out the cat2cat.dataclass.cat2cat_mappings for more information.
ml (Optional[cat2cat_ml]) – dataclass with ml related arguments. Please check out the cat2cat.dataclass.cat2cat_ml for more information.

Returns:

with 2 DataFrames, old and new. There will be added additional columns like index_c2c, g_new_c2c, wei_freq_c2c, rep_c2c, wei_(ml method name)_c2c. Additional columns will be informative only for a one DataFrame as we always make the changes to one direction.

Return type:

dict

Note

1. Without ml section only simple frequencies are assessed. When ml model is broken then weights from simple frequencies are taken. knn method is recommended for smaller datasets.

2. Please be sure that the categorical variable is of the same type in all places. mappings.trans arg columns and the data.cat_var column have to be of the same type. When ml part is applied then ml.cat_var has to have the same type too. Changes have to be made at the same time for the mapping table and datasets.

3. Missing values in the mapping table or categorical variable can cause problems. It is recommended to use string or float types in the mapping table and for categorical variable. Alternative solution can be representing missing values as a specific number (9999) or string (“Missing”).

>>> from cat2cat import cat2cat
>>> from cat2cat.dataclass import cat2cat_data, cat2cat_mappings, cat2cat_ml
>>> from sklearn.ensemble import RandomForestClassifier
>>> from cat2cat.datasets import load_trans, load_occup
>>> trans = load_trans()
>>> occup = load_occup()
>>> o_old = occup.loc[occup.year == 2008, :].copy()
>>> o_new = occup.loc[occup.year == 2010, :].copy()
>>> data = cat2cat_data(old = o_old, new = o_new, cat_var_old = "code",
...                     cat_var_new = "code", time_var = "year")
>>> mappings = cat2cat_mappings(trans = trans, direction = "forward")
>>> cat2cat(data = data, mappings = mappings)
{...

cat2cat.cat2cat_ml_run(mappings: cat2cat.dataclass.cat2cat_mappings, ml: cat2cat.dataclass.cat2cat_ml, **kwargs: Any) → cat2cat_ml_run_results

Automatic mapping in a panel dataset - cat2cat procedure

Parameters:

mappings (cat2cat_mappings) – dataclass with mappings related arguments. Please check out the cat2cat.dataclass.cat2cat_mappings for more information.
ml (Optional[cat2cat_ml]) – dataclass with ml related arguments. Please check out the cat2cat.dataclass.cat2cat_ml for more information.
**kwargs – additional arguments passed to the cat2cat_ml_run function. min_match (float): minimum share of categories from the base period that have to be matched in the mapping table. Between 0 and 1. Default 0.8. test_prop (float): share of the data used for testing. Between 0 and 1. Default 0.2. split_seed (int): random seed for the train_test_split function. Default 42.

Returns:

cat2cat_ml_run_class

Note

Please check out the cat2cat.cat2cat.cat2cat for more information.

>>> from cat2cat import cat2cat
>>> from cat2cat.cat2cat_ml import cat2cat_ml_run
>>> from cat2cat.dataclass import cat2cat_data, cat2cat_mappings, cat2cat_ml
>>> from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
>>> from sklearn.tree import DecisionTreeClassifier
>>> from cat2cat.datasets import load_trans, load_occup
>>> trans = load_trans()
>>> occup = load_occup()
>>> o_old = occup.loc[occup.year == 2008, :].copy()
>>> o_new = occup.loc[occup.year == 2010, :].copy()
>>> mappings = cat2cat_mappings(trans = trans, direction = "backward")
>>> ml = cat2cat_ml(
...    occup.loc[occup.year >= 2010, :].copy(),
...    "code",
...    ["salary", "age", "edu", "sex"],
...    [DecisionTreeClassifier(random_state=1234), LinearDiscriminantAnalysis()]
... )
>>> cat2cat_ml_run(mappings = mappings, ml = ml)
...