cat2cat ======= .. py:module:: cat2cat Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/cat2cat/cat2cat/index /autoapi/cat2cat/cat2cat_ml/index /autoapi/cat2cat/cat2cat_utils/index /autoapi/cat2cat/data/index /autoapi/cat2cat/dataclass/index /autoapi/cat2cat/datasets/index /autoapi/cat2cat/mappings/index Attributes ---------- .. autoapisummary:: cat2cat.__version__ Functions --------- .. autoapisummary:: cat2cat.cat2cat cat2cat.cat2cat_ml_run Package Contents ---------------- .. py:data:: __version__ .. py:function:: cat2cat(data: cat2cat.dataclass.cat2cat_data, mappings: cat2cat.dataclass.cat2cat_mappings, ml: Optional[cat2cat.dataclass.cat2cat_ml] = None) -> Dict[str, pandas.DataFrame] Automatic mapping in a panel dataset - cat2cat procedure :param data: dataclass with data related arguments. Please check out the `cat2cat.dataclass.cat2cat_data` for more information. :type data: cat2cat_data :param mappings: dataclass with mappings related arguments. Please check out the `cat2cat.dataclass.cat2cat_mappings` for more information. :type mappings: cat2cat_mappings :param ml: dataclass with ml related arguments. Please check out the `cat2cat.dataclass.cat2cat_ml` for more information. :type ml: Optional[cat2cat_ml] :returns: with 2 DataFrames, old and new. There will be added additional columns like index_c2c, g_new_c2c, wei_freq_c2c, rep_c2c, wei_(ml method name)_c2c. Additional columns will be informative only for a one DataFrame as we always make the changes to one direction. :rtype: dict .. note:: 1. Without ml section only simple frequencies are assessed. When ml model is broken then weights from simple frequencies are taken. `knn` method is recommended for smaller datasets. 2. Please be sure that the categorical variable is of the same type in all places. `mappings.trans` arg columns and the `data.cat_var` column have to be of the same type. When ml part is applied then `ml.cat_var` has to have the same type too. Changes have to be made at the same time for the mapping table and datasets. 3. Missing values in the mapping table or categorical variable can cause problems. It is recommended to use string or float types in the mapping table and for categorical variable. Alternative solution can be representing missing values as a specific number (9999) or string ("Missing"). >>> from cat2cat import cat2cat >>> from cat2cat.dataclass import cat2cat_data, cat2cat_mappings, cat2cat_ml >>> from sklearn.ensemble import RandomForestClassifier >>> from cat2cat.datasets import load_trans, load_occup >>> trans = load_trans() >>> occup = load_occup() >>> o_old = occup.loc[occup.year == 2008, :].copy() >>> o_new = occup.loc[occup.year == 2010, :].copy() >>> data = cat2cat_data(old = o_old, new = o_new, cat_var_old = "code", ... cat_var_new = "code", time_var = "year") >>> mappings = cat2cat_mappings(trans = trans, direction = "forward") >>> cat2cat(data = data, mappings = mappings) {... .. py:function:: cat2cat_ml_run(mappings: cat2cat.dataclass.cat2cat_mappings, ml: cat2cat.dataclass.cat2cat_ml, **kwargs: Any) -> cat2cat_ml_run_results Automatic mapping in a panel dataset - cat2cat procedure :param mappings: dataclass with mappings related arguments. Please check out the `cat2cat.dataclass.cat2cat_mappings` for more information. :type mappings: cat2cat_mappings :param ml: dataclass with ml related arguments. Please check out the `cat2cat.dataclass.cat2cat_ml` for more information. :type ml: Optional[cat2cat_ml] :param \*\*kwargs: additional arguments passed to the `cat2cat_ml_run` function. min_match (float): minimum share of categories from the base period that have to be matched in the mapping table. Between 0 and 1. Default 0.8. test_prop (float): share of the data used for testing. Between 0 and 1. Default 0.2. split_seed (int): random seed for the train_test_split function. Default 42. :returns: cat2cat_ml_run_class .. note:: Please check out the `cat2cat.cat2cat.cat2cat` for more information. >>> from cat2cat import cat2cat >>> from cat2cat.cat2cat_ml import cat2cat_ml_run >>> from cat2cat.dataclass import cat2cat_data, cat2cat_mappings, cat2cat_ml >>> from sklearn.discriminant_analysis import LinearDiscriminantAnalysis >>> from sklearn.tree import DecisionTreeClassifier >>> from cat2cat.datasets import load_trans, load_occup >>> trans = load_trans() >>> occup = load_occup() >>> o_old = occup.loc[occup.year == 2008, :].copy() >>> o_new = occup.loc[occup.year == 2010, :].copy() >>> mappings = cat2cat_mappings(trans = trans, direction = "backward") >>> ml = cat2cat_ml( ... occup.loc[occup.year >= 2010, :].copy(), ... "code", ... ["salary", "age", "edu", "sex"], ... [DecisionTreeClassifier(random_state=1234), LinearDiscriminantAnalysis()] ... ) >>> cat2cat_ml_run(mappings = mappings, ml = ml) ...