# cat2cat

## About

Unifying an inconsistent coded categorical variable in a panel/longtitudal dataset

There is offered the cat2cat procedure to map a categorical variable according to a mapping (transition) table between two different time points. The mapping (transition) table should to have a candidate for each category from the targeted for an update period. The main rule is to replicate the observation if it could be assigned to a few categories, then using simple frequencies or statistical methods to approximate probabilities of being assigned to each of them.

**This algorithm was invented and implemented in the paper by [(Nasinski, Majchrowska and Broniatowska (2020))](https://doi.org/10.24425/cejeme.2020.134747).**

**For more details please read the paper by [(Nasinski, Gajowniczek (2023))](https://doi.org/10.1016/j.softx.2023.101525).**


## Graph - cat2cat procedure

The graphs present how the `cat2cat` function (and the underlying procedure) works, in this case under a panel dataset without the unique identifiers and only two periods.

![Backward Mapping](https://raw.githubusercontent.com/Polkas/cat2cat/master/man/figures/back_nom.png)

![Forward Mapping](https://raw.githubusercontent.com/Polkas/cat2cat/master/man/figures/for_nom.png)


## Example usage

To use `cat2cat` in a project:

### Load example data

```python
# cat2cat datasets
from cat2cat.datasets import load_trans, load_occup, load_verticals
from numpy.random import seed

seed(1234)

trans = load_trans()
occup = load_occup()
verticals = load_verticals()
```

### Low-level functions

```python
from cat2cat.mappings import get_mappings, get_freqs, cat_apply_freq

# convert the mapping table to two association lists
mappings = get_mappings(trans)
# get a variable levels freqencies
codes_new = occup.code[occup.year == 2010].values
freqs = get_freqs(codes_new)
# apply the frequencies to the (one) association list
mapp_new_p = cat_apply_freq(mappings["to_new"], freqs)

# mappings for a specific category
print(mappings["to_new"]['3481'])
# probability mappings for a specific category
print(mapp_new_p['3481'])
```

### cat2cat procedure - one iteration

```python
from cat2cat import cat2cat
from cat2cat.dataclass import cat2cat_data, cat2cat_mappings, cat2cat_ml

from pandas import concat

# split the panel by the time variale
# here only two periods
o_old = occup.loc[occup.year == 2008, :].copy()
o_new = occup.loc[occup.year == 2010, :].copy()

# dataclasses, two core arguments for the cat2cat function
data = cat2cat_data(
    old = o_old, 
    new = o_new, 
    cat_var_old = "code", 
    cat_var_new = "code", 
    time_var = "year"
)
mappings = cat2cat_mappings(trans = trans, direction = "backward")

# apply the cat2cat procedure
c2c = cat2cat(data = data, mappings = mappings)
# pandas.concat used to bind per period datasets
data_final = concat([c2c["old"], c2c["new"]])

sub_cols = ["id", "edu", "code", "year", "index_c2c", "g_new_c2c", "rep_c2c", "wei_naive_c2c", "wei_freq_c2c"]
data_final.groupby(["year"]).sample(5).loc[:, sub_cols]
```

### With ML

```python
from sklearn.neighbors import KNeighborsClassifier
from cat2cat import cat2cat_ml_run

# ml dataclass, one of the arguments for the cat2cat function
ml = cat2cat_ml(
    data = o_new, 
    cat_var = "code", 
    features = ["salary", "age", "edu"], 
    models = [KNeighborsClassifier(random_state = 1234)]
)

cat2cat_ml_run(mappings, ml)

# apply the cat2cat procedure
c2c = cat2cat(data = data, mappings = mappings, ml = ml)
# pandas.concat used to bind per period datasets
data_final = concat([c2c["old"], c2c["new"]])

sub_cols = ["id", "year", "wei_naive_c2c", "wei_freq_c2c", "wei_KNeighborsClassifier_c2c"]
data_final.groupby(["year"]).sample(3).loc[:, sub_cols]
```

With 4 periods, one mapping table and backward direction:

```python
from cat2cat.cat2cat_utils import dummy_c2c

# split the panel by the time variale
# here four periods
o_2006 = occup.loc[occup.year == 2006, :].copy()
o_2008 = occup.loc[occup.year == 2008, :].copy()
o_2010 = occup.loc[occup.year == 2010, :].copy()
o_2012 = occup.loc[occup.year == 2012, :].copy()

# dataclasses, two core arguments for the cat2cat function
data = cat2cat_data(
    old = o_2008, 
    new = o_2010, 
    cat_var_old = "code", 
    cat_var_new = "code", 
    time_var = "year"
)
mappings = cat2cat_mappings(trans = trans, direction = "backward")

# apply the cat2cat procedure
occup_back_2008_2010 = cat2cat(data = data, mappings = mappings)

# updated for the next iteration data cat2cat argument
data = cat2cat_data(
    old = o_2006, 
    new = occup_back_2008_2010["old"], 
    cat_var_old = "code", 
    cat_var_new = "g_new_c2c", 
    time_var = "year"
)

# apply the cat2cat procedure
occup_back_2006_2008 = cat2cat(data = data, mappings = mappings)

# gather the datasets for each period
o_2006_n = occup_back_2006_2008["old"]
o_2008_n = occup_back_2006_2008["new"] # or occup_back_2008_2010["old"]
o_2010_n = occup_back_2008_2010["new"]
o_2012_n = dummy_c2c(o_2012, "code")

# pandas.concat used to bind per period datasets
data_final = concat([o_2006_n, o_2008_n, o_2010_n, o_2012_n])

sub_cols = ["id", "edu", "code", "year", "index_c2c",
 "g_new_c2c", "rep_c2c", "wei_naive_c2c", "wei_freq_c2c"]
data_final.groupby(["year"]).sample(2).loc[:, sub_cols]
```

### Prune - prune_c2c


Pruning which could be useful after the mapping process, the custom prune_fun is provided by the end user.
The prune_fun is a function to process a 1D-array of weights (float) and return a 1D-array of boolean of the same length. The weighs will be reweighted automatically to still to sum to one per each original observation.

- non-zero - lambda x: x > 0
- highest1 - lambda x: arange(len(x)) == argmax(x)
- highest - lambda x: x == max(x)

```python
from cat2cat.cat2cat_utils import prune_c2c
from numpy import arange, argmax

# prune_c2c
# highest1 leave only one observation with the highest probability for each orginal one
(o_2006_n.shape[0], 
 prune_c2c(o_2006_n, lambda x: arange(len(x)) == argmax(x)).shape[0])
```

### Direct match


It is important to set the `id_var` argument as then we merging categories 1 to 1
for this identifier which exists in both periods.

```python
# split the panel by the time variable
vert_old = verticals.loc[verticals["v_date"] == "2020-04-01", :]
vert_new = verticals.loc[verticals["v_date"] == "2020-05-01", :]

## extract mapping (transition) table from data using identifier
trans_v = vert_old.merge(vert_new, on = "ean", how = "inner")\
.loc[:, ["vertical_x", "vertical_y"]]\
.drop_duplicates()
```

```python
# dataclasses, two core arguments for the cat2cat function
data = cat2cat_data(
  old = vert_old, 
  new = vert_new, 
  id_var = "ean", 
  cat_var_old = "vertical", 
  cat_var_new = "vertical", 
  time_var = "v_date"
)
mappings = cat2cat_mappings(trans = trans_v, direction = "backward")

# apply the cat2cat procedure
verts = cat2cat(
  data = data,
  mappings = mappings
)

# pandas.concat used to bind per period datasets
data_final = concat([verts["old"], verts["new"]])
```

### Direct match with ML

```python
# ml dataclass, one of the arguments for the cat2cat function
ml = cat2cat_ml(
    data = vert_old, 
    cat_var = "vertical", 
    features = ["sales"], 
    models = [KNeighborsClassifier()]
)

# apply the cat2cat procedure
verts_ml = cat2cat(
  data = data,
  mappings = mappings,
  ml = ml
)

# pandas.concat used to bind per period datasets
data_final = concat([verts_ml["old"], verts_ml["new"]])
```