从pandas Dataframe 中替换字符并提取子字符串

9vw9lbht  于 2023-05-21  发布在  其他
关注(0)|答案(1)|浏览(144)

我有以下Pandas数据框。我想替换一些字符并提取子字符串(原始 Dataframe 中存在更多行)。
我正在使用以下正则表达式,但无法替换'?从一些行如第6、7、8行。
df'label','id' = df['name'].str.extract(r'{???|?[[{]?(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+(\d+)')

You-Hoover-Fong syndrome, 616954 (3)
Yuan-Harel-Lupski syndrome (4)
Zaki syndrome, 619648 (3)
Zimmermann-Laband syndrome 2, 616455 (3)
Zimmermann-Laband syndrome 3, 618658 (3)
[?Birbeck granule deficiency], 613393 (3)
[?Homosexuality, male] (2)
[?Phosphohydroxylysinuria], 615011 (3)
[Acetylation, slow], 243400 (3)

预期输出为:

You-Hoover-Fong syndrome          616954  
Yuan-Harel-Lupski syndrome
Zaki syndrome                     619648
Zimmermann-Laband syndrome 2      616455 
Zimmermann-Laband syndrome 3      618658 
Birbeck granule deficiency        613393 
Homosexuality, male 
Phosphohydroxylysinuria           615011 
Acetylation, slow                 243400

如何修改当前正则表达式以包含'?从上述行中删除?

hi3rlvi2

hi3rlvi21#

尝试:

df['number'] = df['text'].str.extract(r'(\d{6})').fillna('')
df['text'] = df['text'].str.extract(r'^[^a-zA-Z]*(.*?(?:\s*(?<!\()\d{,2}))[^a-zA-Z]*$')
df['text'] = df['text'].str.strip()
print(df)

图纸:

text  number
0      You-Hoover-Fong syndrome  616954
1    Yuan-Harel-Lupski syndrome        
2                 Zaki syndrome  619648
3  Zimmermann-Laband syndrome 2  616455
4  Zimmermann-Laband syndrome 3  618658
5    Birbeck granule deficiency  613393
6           Homosexuality, male        
7       Phosphohydroxylysinuria  615011
8             Acetylation, slow  243400

初始 Dataframe :

text
0       You-Hoover-Fong syndrome, 616954 (3)
1             Yuan-Harel-Lupski syndrome (4)
2                  Zaki syndrome, 619648 (3)
3   Zimmermann-Laband syndrome 2, 616455 (3)
4   Zimmermann-Laband syndrome 3, 618658 (3)
5  [?Birbeck granule deficiency], 613393 (3)
6                 [?Homosexuality, male] (2)
7     [?Phosphohydroxylysinuria], 615011 (3)
8            [Acetylation, slow], 243400 (3)

相关问题