我有一个Pandas的数据框,看起来像这样:
docdb tech_classes
1187498 ['Y02P 20/10']
1236571 ['Y02B 30/13' 'Y02B 30/12' 'Y02P 20/10']
1239098 ['Y10S 426/805' 'Y02A 40/81']
...
我想做的是创建N个伪变量,其中N是变量tech_classes中出现的名称总数(请注意Y02P 20/10是唯一名称,就好像它是:Y02P_20/10和Y02B 30/13等)。只要docdb在tech_classes中有该类,变量就应该是值为1的伪变量。
换句话说,上述示例的结果应如下所示:
docdb Y02P_20/10 Y02B_30/13 Y02B_30/12 Y02A_40/81 Y10S_426/805 ...
1187498 1 0 0 0 0
1236571 1 1 1 0 0
1239098 0 0 0 1 1
...
多谢了!
另外,我知道Pandas里有get_dummies,但它不太起作用,因为tech_classes不是列表形式,具体来说:
df_patents.head().to_dict('list')
给出:
{'docdb_family_id': [1187498, 1226468, 1236571, 1239098, 1239277],
'tech_fields_cited': ["['Y02P_20_10']",
"['Y10T_156_1023']",
"['Y02B_30_13','Y02B_30_12','Y02E_60_14','Y02B_10_70']",
"['Y10S_426_805','Y02A_40_81']",
"['Y02E_60_10','Y02T_90_12','Y02T_10_7072','Y02T_90_14','Y02T_10_70']"],
'patindocdb_years': ['[1998 1999 1996]',
'[1996 1992 1994 1993 1997]',
'[1991 1993 1990 1996]',
'[1995 1992 1993]',
'[1996 1993 1992]'],
'appln_auth': ['DE', 'DE', 'WO', 'WO', 'WO'],
'appln_nr': ['19581932', '4042441', '9002512', '9103158', '9105114'],
'earliest_publn_year': [1998, 1992, 1991, 1992, 1993],
'nb_citing_docdb_fam_y': [5, 17, 35, 32, 35],
'person_ctrycode': ["['RU']", "['DE']", "['US']", "['US']", "['IL']"],
'fronteer': [0, 0, 0, 0, 0],
'distance': [9999, 2, 9999, 9999, 9999],
'oecd_fields': ['[nan]', '[nan]', '[nan]', '[nan]', '[nan]'],
'nr_green': [1, 3, 5, 4, 10],
'pctage_green': [0.2, 0.17647059, 0.14285715, 0.125, 0.2857143],
'id_mas': [1, 2, 3, 4, 5],
'avg_dist_citing': ['[0.6666666666666666]',
'[2.5]',
'[inf]',
'[inf]',
'[inf]'],
'dist_citing_patents2': ['[1, 1, 0]',
'[3, 3, 1, 3, 2, 3]',
'[5, 99999, 5, 2, 5, 99999, 4, 6, 99999, 6, 7, 7, 2, 0, 1, 0, 0, 0, 1, 0, 3, 1, 1]',
'[99999, 99999, 99999, 99999, 99999, 99999, 2, 2, 2, 99999, 99999, 2, 2, 99999, 4, 99999, 3, 2, 0, 1, 1, 1, 3, 99999, 99999]',
'[99999, 1, 1, 1, 1, 3, 1, 1, 1, 99999, 6, 1, 2, 99999, 5, 4, 3, 0, 2, 1, 1, 1, 1, 2, 1, 1, 0, 0, 2, 0, 3, 2]'],
'id_us': [3, 4, 5, 6, 7],
'y_tr1': [0.60000002, 0.05882353, 0.25714287, 0.125, 0.51428574],
'y_tr2': [0.60000002, 0.11764706, 0.31428573, 0.3125, 0.65714288],
'y_tr3': [0.60000002, 0.35294119, 0.34285715, 0.375, 0.74285716],
'y_tr4': [0.60000002, 0.35294119, 0.37142858, 0.40625, 0.77142859],
'y_tr5': [0.60000002, 0.35294119, 0.45714286, 0.40625, 0.80000001]}
2条答案
按热度按时间f87krz0w1#
假设
tech_classes
中有列表,可以连接字符串并使用str.get_dummies
:输出:
更新
列表实际上有字符串表示,虽然首先用
ast.literal_eval
转换为列表时可以使用上述方法,但更有效的方法是:如果您想要快速测试:
按块
osh3o9ms2#
您似乎正在寻找
explode
和get_dummies