Skip to content

Tutorials

Boston Housing Data

Load the data

from cem.match import match
from cem.coarsen import coarsen
from cem.imbalance import L1

import statsmodels.api as sm

boston = load_boston()

O = "MEDV"  # outcome variable
T = "CHAS"  # treatment variable

y = boston[O]
X = boston.drop(columns=O)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24
1 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6
2 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33 36.2

Automatic Coarsening

First we coarsen the data in an automatic fashion to get a baseline imbalance. Be sure to drop the column containing your outcome variable prior to coarsening/matching. coarsen optionally takes a list of columns you'd like to auto-coarsen, ignoring the rest.

# coarsen predictor variables
X_coarse = coarsen(X, T, "l1")

# match observations
weights = match(X_coarse, T)

# calculate weighted imbalance
L1(X_coarse, T, weights)

Informed Coarsening

It's recommended to coarsen using pandas.cut and pandas.qcut, but you are free to coarsen your predictor variables however you wish.

# coarsen predictor variables
schema = {
   'CRIM': (pd.cut, {'bins': 4}),
   'ZN': (pd.qcut, {'q': 4}),
   'INDUS': (pd.qcut, {'q': 4}),
   'NOX': (pd.cut, {'bins': 5}),
   'RM': (pd.cut, {'bins': 5}),
   'AGE': (pd.cut, {'bins': 5}),
   'DIS': (pd.cut, {'bins': 5}),
   'RAD': (pd.cut, {'bins': 6}),
   'TAX': (pd.cut, {'bins': 5}),
   'PTRATIO': (pd.cut, {'bins': 6}),
   'B': (pd.cut, {'bins': 5}),
   'LSTAT': (pd.cut, {'bins': 5})
}

X_coarse = X.apply(lambda x: schema[x.name][0](x, **schema[x.name][1]) if x.name in schema else x)

# match observations
weights = match(X_coarse, T)

# calculate weighted imbalance
L1(X_coarse, T, weights)

# perform weighted regression
model = sm.WLS(y, sm.add_constant(X), weights=weights)