Python:倾向性评分匹配

倾向性评分匹配(propensity score matching, PSM)主要是在随机对照试验(Randomized controlled trials,RCT)中用于衡量treat组和control组样本的其他各项特征(如年龄、体重、身高、人种等)的整体均衡性的度量。比如说研究一种药物对疾病的影响,在临床实验中,treat组和control组除了使用药物(安慰剂)不同外,其他的临床特征(如年龄、体重等)都应该基本是相似的,这样treat和control组才有可比性,进而才能验证药物的有效性。

如下图所示,该治疗方法实际上是无效的,但是由于分组中年龄的不平衡导致得出错误的结论。

Python:倾向性评分匹配

RR值,(Risk Ratio)称为相对危险度,或者危险比,实质就是两个率的比,是两组真实发病率、患病率或死亡率的比值。我们在讨论RR值的意义时,应该把RR值和1比,>1者是患病、死亡的危险因素,反之,是保护因素。

1、导入模块

%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'

import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

%watermark -a 'xujun' -d -t -v -p numpy,scipy,pandas,sklearn,matplotlib,seaborn
Python:倾向性评分匹配

2、导入数据

num_cols = ['age','sofa']
cat_cols = ['first_careunit','echo','gender','mort_28_day','icd_chf']
dtype = {col: 'category' for col in cat_cols}
df = pd.read_csv('./merged_data.csv',dtype=dtype)
df.head()
Python:倾向性评分匹配

3、将分类变量的那一列转换为0,1,2,…等

col_mappings = {}
for col in ('first_careunit','echo','gender','mort_28_day','icd_chf'):
    col_mapping = dict(enumerate(df[col].cat.categories))
    col_mappings[col] = col_mapping
print(col_mappings)

for col in ('first_careunit','echo','gender','mort_28_day','icd_chf'):
    df[col] = df[col].cat.codes
df.head()
Python:倾向性评分匹配
features = df.columns.tolist() # 转为列表
features.remove('echo') # remove treatment
features.remove('mort_28_day') # remove outcome
agg_operations = {'echo': 'count'}
agg_operations.update({
    feature: ['mean', 'std'] for feature in features
})

table_one = df.groupby('echo').agg(agg_operations)
table_one.head()
Python:倾向性评分匹配
def compute_table_one_smd(table_one: pd.DataFrame, round_digits: int=4) -> pd.DataFrame:
    feature_smds = []
    for feature in features:
        feature_table_one = table_one[feature].values
        neg_mean = feature_table_one[0, 0]
        neg_std = feature_table_one[0, 1]
        pos_mean = feature_table_one[1, 0]
        pos_std = feature_table_one[1, 1]

        smd = (pos_mean - neg_mean) / np.sqrt((pos_std ** 2 + neg_std ** 2) / 2)
        smd = round(abs(smd), round_digits)
        feature_smds.append(smd)

    return pd.DataFrame({'features': features, 'smd': feature_smds})


table_one_smd = compute_table_one_smd(table_one)
table_one_smd
Python:倾向性评分匹配
#回归
logistic = LogisticRegression(solver='liblinear')
logistic.fit(data, treatment)

#预测
pscore = logistic.predict_proba(data)[:, 1]
pscore

#得分
roc_auc_score(treatment, pscore)
mask = treatment == 1
pos_pscore = pscore[mask]
neg_pscore = pscore[~mask]
print('treatment count:', pos_pscore.shape)
print('control count:', neg_pscore.shape)

结果:treatment count: (3262,) control count: (3099,)

# 画图
plt.rcParams['figure.figsize'] = 8, 6
plt.rcParams['font.size'] = 12

sns.distplot(neg_pscore, label='control')
sns.distplot(pos_pscore, label='treatment')
plt.xlim(0, 1)
plt.title('Propensity Score Distribution of Control vs Treatment')
plt.ylabel('Density')
plt.xlabel('Scores')
plt.legend()
plt.tight_layout()
plt.show()
Python:倾向性评分匹配
def get_similar(pos_pscore: np.ndarray, neg_pscore: np.ndarray, topn: int=5, n_jobs: int=1):
    from sklearn.neighbors import NearestNeighbors

    knn = NearestNeighbors(n_neighbors=topn + 1, metric='euclidean', n_jobs=n_jobs)
    knn.fit(neg_pscore.reshape(-1, 1))

    distances, indices = knn.kneighbors(pos_pscore.reshape(-1, 1))
    sim_distances = distances[:, 1:]
    sim_indices = indices[:, 1:]
    return sim_distances, sim_indices


sim_distances, sim_indices = get_similar(pos_pscore, neg_pscore, topn=1)
sim_indices
_, counts = np.unique(sim_indices[:, 0], return_counts=True)
np.bincount(counts)
#重新把echo和mort_28_day这两列弄回到df_cheaned中
df_cleaned['echo'] = treatment
df_cleaned['mort_28_day'] = death

df_pos = df_cleaned[mask]
df_neg = df_cleaned[~mask].iloc[sim_indices[:, 0]]
df_matched = pd.concat([df_pos, df_neg], axis=0)
df_matched.head()
table_one_matched = df_matched.groupby('mort_28_day').agg(agg_operations)
table_one_smd_matched = compute_table_one_smd(table_one_matched)
table_one_smd_matched
num_matched_pairs = df_neg.shape[0]
print('number of matched pairs: ', num_matched_pairs)

# pair t-test
stats.ttest_rel(df_pos['mort_28_day'].values, df_neg['mort_28_day'].values)
number of matched pairs:  3262
Ttest_relResult(statistic=-7.974066292969522, pvalue=2.1050737386146504e-15)
data_final = pd.DataFrame()
df_pos = df[mask]
df_neg = df[~mask].iloc[sim_indices[:, 0]]
data_final = pd.concat([df_pos, df_neg], axis=0)
data_final.to_csv('merged_data_final.csv', index=False)

然后重新导入merged_data_final.csv,画图看看

Python:倾向性评分匹配

    特别申明:本文为转载文章,转载自,不代表贪吃的夜猫子立场,如若转载,请注明出处:http://ethen8181.github.io/machine-learning/ab_tests/causal_inference/matching.html#The-Definition-of-Causal-Effect

    (1)
    打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
    xujunzju管理者
    上一篇 2022年4月9日 07:00
    下一篇 2022年4月11日 15:55

    相关推荐

    发表回复

    登录后才能评论

    评论列表(2条)

    • 1292 2023年8月26日 12:33

      您好!请问 df_cleaned是什么呀 期待回复

      • xujunzju 2023年8月26日 16:35

        @1292cat_cols = [CAT1]
        df_one_hot = pd.get_dummies(df[cat_cols], drop_first=True)
        df_cleaned = pd.concat([df[num_cols], df_one_hot, df[[SEX, TREATMENT, DEATH]]], axis=1)
        df_cleaned.head()
        可能被我跳过了,去看原始的http://ethen8181.github.io/machine-learning/ab_tests/causal_inference/matching.html#The-Definition-of-Causal-Effect

    联系我们
    邮箱:
    xujunzju@gmail.com
    公众号:
    xujunzju6174
    捐赠本站
    捐赠本站
    分享本页
    返回顶部