我有一个大约500k行的 Dataframe ,其中包含一个名为 country
,等等。我的目标是替换 country
专栏有不同的排版。
例如:
import pandas as pd
# Starting dataset:
d = {'country': ['Unites Sates', 'United state','Cnda','canada','United State', 'United sates of America','Mexio','mexico','Mejico','America','U.S.A.','UsA of A','cAnada','u. s. a. ','United States of America']}
df = pd.DataFrame(data=d)
df
country
0 Unites Sates #wants to replace
1 United state #wants to replace
2 Cnda #wants to replace
3 canada #wants to replace
4 United State #wants to replace
5 United sates of America #wants to replace
6 Mexio #wants to replace
7 Mexico #wants to replace
8 Mejico #wants to replace
9 America #wants to replace
10 U.S.A. #wants to replace
11 UsA of A #wants to replace
12 cAnada #wants to replace
13 u. s. a. #wants to replace
14 United States of America
# Expected Outcome:
d = {'country': ['United States of America','United States of America','Canada','Canada','United States of America','United States of America','Mexico','Mexico','Mexico', 'United States of America','United States of America','United States of America','Canada','United States of America','United States of America']}
df = pd.DataFrame(data=d)
df
country
0 United States of America #replaced
1 United States of America #replaced
2 Canada #replaced
3 Canada #replaced
4 United States of America #replaced
5 United States of America #replaced
6 Mexico #replaced
7 Mexico #replaced
8 Mexico #replaced
9 United States of America #replaced
10 United States of America #replaced
11 United States of America #replaced
12 Canada #replaced
13 United States of America #replaced
14 United States of America
我尝试的一件事是创建一个名为 correct_countries_df
包含正确的国家/地区名称,并将其用作:
df['country_BestMatch'] = df['country'].map(lambda x: process.extractOne(x, correct_countries_df['country'])[0])
但我似乎不能做到这一点。
有什么想法吗?
提前谢谢!
1条答案
按热度按时间9lowa7mx1#
如果你的
correct_countries_df
看起来像:那么,您的代码是正确的