pandas 从列表变量python中创建一个伪变量

gdrx4gfi  于 2023-01-19  发布在  Python
关注(0)|答案(2)|浏览(153)

我有一个Pandas的数据框,看起来像这样:

docdb    tech_classes
1187498     ['Y02P 20/10']
1236571     ['Y02B 30/13' 'Y02B 30/12' 'Y02P 20/10']
1239098     ['Y10S 426/805' 'Y02A 40/81']
...

我想做的是创建N个伪变量,其中N是变量tech_classes中出现的名称总数(请注意Y02P 20/10是唯一名称,就好像它是:Y02P_20/10和Y02B 30/13等)。只要docdb在tech_classes中有该类,变量就应该是值为1的伪变量。
换句话说,上述示例的结果应如下所示:

docdb Y02P_20/10 Y02B_30/13 Y02B_30/12 Y02A_40/81 Y10S_426/805 ...
1187498  1             0          0          0          0
1236571  1             1          1          0          0
1239098  0             0          0          1          1
...

多谢了!
另外,我知道Pandas里有get_dummies,但它不太起作用,因为tech_classes不是列表形式,具体来说:

df_patents.head().to_dict('list')

给出:

{'docdb_family_id': [1187498, 1226468, 1236571, 1239098, 1239277],
 'tech_fields_cited': ["['Y02P_20_10']",
  "['Y10T_156_1023']",
  "['Y02B_30_13','Y02B_30_12','Y02E_60_14','Y02B_10_70']",
  "['Y10S_426_805','Y02A_40_81']",
  "['Y02E_60_10','Y02T_90_12','Y02T_10_7072','Y02T_90_14','Y02T_10_70']"],
 'patindocdb_years': ['[1998 1999 1996]',
  '[1996 1992 1994 1993 1997]',
  '[1991 1993 1990 1996]',
  '[1995 1992 1993]',
  '[1996 1993 1992]'],
 'appln_auth': ['DE', 'DE', 'WO', 'WO', 'WO'],
 'appln_nr': ['19581932', '4042441', '9002512', '9103158', '9105114'],
 'earliest_publn_year': [1998, 1992, 1991, 1992, 1993],
 'nb_citing_docdb_fam_y': [5, 17, 35, 32, 35],
 'person_ctrycode': ["['RU']", "['DE']", "['US']", "['US']", "['IL']"],
 'fronteer': [0, 0, 0, 0, 0],
 'distance': [9999, 2, 9999, 9999, 9999],
 'oecd_fields': ['[nan]', '[nan]', '[nan]', '[nan]', '[nan]'],
 'nr_green': [1, 3, 5, 4, 10],
 'pctage_green': [0.2, 0.17647059, 0.14285715, 0.125, 0.2857143],
 'id_mas': [1, 2, 3, 4, 5],
 'avg_dist_citing': ['[0.6666666666666666]',
  '[2.5]',
  '[inf]',
  '[inf]',
  '[inf]'],
 'dist_citing_patents2': ['[1, 1, 0]',
  '[3, 3, 1, 3, 2, 3]',
  '[5, 99999, 5, 2, 5, 99999, 4, 6, 99999, 6, 7, 7, 2, 0, 1, 0, 0, 0, 1, 0, 3, 1, 1]',
  '[99999, 99999, 99999, 99999, 99999, 99999, 2, 2, 2, 99999, 99999, 2, 2, 99999, 4, 99999, 3, 2, 0, 1, 1, 1, 3, 99999, 99999]',
  '[99999, 1, 1, 1, 1, 3, 1, 1, 1, 99999, 6, 1, 2, 99999, 5, 4, 3, 0, 2, 1, 1, 1, 1, 2, 1, 1, 0, 0, 2, 0, 3, 2]'],
 'id_us': [3, 4, 5, 6, 7],
 'y_tr1': [0.60000002, 0.05882353, 0.25714287, 0.125, 0.51428574],
 'y_tr2': [0.60000002, 0.11764706, 0.31428573, 0.3125, 0.65714288],
 'y_tr3': [0.60000002, 0.35294119, 0.34285715, 0.375, 0.74285716],
 'y_tr4': [0.60000002, 0.35294119, 0.37142858, 0.40625, 0.77142859],
 'y_tr5': [0.60000002, 0.35294119, 0.45714286, 0.40625, 0.80000001]}
f87krz0w

f87krz0w1#

假设tech_classes中有列表,可以连接字符串并使用str.get_dummies

df = df.join(df.pop('tech_classes').agg('|'.join).str.get_dummies())

输出:

docdb  Y02A 40/81  Y02B 30/12  Y02B 30/13  Y02P 20/10  Y10S 426/805
0  1187498           0           0           0           1             0
1  1236571           0           1           1           1             0
2  1239098           1           0           0           0             1
更新

列表实际上有字符串表示,虽然首先用ast.literal_eval转换为列表时可以使用上述方法,但更有效的方法是:

df = df.join(df.pop('tech_classes').str[2:-2].str.get_dummies("','"))

如果您想要快速测试:

df['tech_fields_cited'].head().str[2:-2].str.get_dummies("','")
按块
# number of rows to process simultaneously
N = 100_000

lst = []
for k, g in df['tech_fields_cited'].groupby(np.arange(len(df))//N):
    lst.append(g.str[2:-2].str.get_dummies("','"))

out = pd.concat(lst)
osh3o9ms

osh3o9ms2#

您似乎正在寻找explodeget_dummies

pd.get_dummies(df.explode('tech_classes')).groupby('docdb').sum()

相关问题