python/pandas:如何使用fuzzyfuzzy将列中的拼写错误替换为国家名称?

kadbb459  于 2021-08-20  发布在  Java
关注(0)|答案(1)|浏览(295)

我有一个大约500k行的 Dataframe ,其中包含一个名为 country ,等等。我的目标是替换 country 专栏有不同的排版。
例如:

  1. import pandas as pd
  2. # Starting dataset:
  3. d = {'country': ['Unites Sates', 'United state','Cnda','canada','United State', 'United sates of America','Mexio','mexico','Mejico','America','U.S.A.','UsA of A','cAnada','u. s. a. ','United States of America']}
  4. df = pd.DataFrame(data=d)
  5. df
  6. country
  7. 0 Unites Sates #wants to replace
  8. 1 United state #wants to replace
  9. 2 Cnda #wants to replace
  10. 3 canada #wants to replace
  11. 4 United State #wants to replace
  12. 5 United sates of America #wants to replace
  13. 6 Mexio #wants to replace
  14. 7 Mexico #wants to replace
  15. 8 Mejico #wants to replace
  16. 9 America #wants to replace
  17. 10 U.S.A. #wants to replace
  18. 11 UsA of A #wants to replace
  19. 12 cAnada #wants to replace
  20. 13 u. s. a. #wants to replace
  21. 14 United States of America
  22. # Expected Outcome:
  23. d = {'country': ['United States of America','United States of America','Canada','Canada','United States of America','United States of America','Mexico','Mexico','Mexico', 'United States of America','United States of America','United States of America','Canada','United States of America','United States of America']}
  24. df = pd.DataFrame(data=d)
  25. df
  26. country
  27. 0 United States of America #replaced
  28. 1 United States of America #replaced
  29. 2 Canada #replaced
  30. 3 Canada #replaced
  31. 4 United States of America #replaced
  32. 5 United States of America #replaced
  33. 6 Mexico #replaced
  34. 7 Mexico #replaced
  35. 8 Mexico #replaced
  36. 9 United States of America #replaced
  37. 10 United States of America #replaced
  38. 11 United States of America #replaced
  39. 12 Canada #replaced
  40. 13 United States of America #replaced
  41. 14 United States of America

我尝试的一件事是创建一个名为 correct_countries_df 包含正确的国家/地区名称,并将其用作:

  1. df['country_BestMatch'] = df['country'].map(lambda x: process.extractOne(x, correct_countries_df['country'])[0])

但我似乎不能做到这一点。
有什么想法吗?
提前谢谢!

9lowa7mx

9lowa7mx1#

如果你的 correct_countries_df 看起来像:

  1. >>> correct_countries_df
  2. country
  3. 0 United States of America
  4. 1 Canada
  5. 2 Mexico

那么,您的代码是正确的

  1. >>> df['country'].map(lambda x: process.extractOne(x, correct_countries_df['country'])[0])
  2. 0 United States of America
  3. 1 United States of America
  4. 2 Canada
  5. 3 Canada
  6. 4 United States of America
  7. 5 United States of America
  8. 6 Mexico
  9. 7 Mexico
  10. 8 Mexico
  11. 9 United States of America
  12. 10 United States of America
  13. 11 United States of America
  14. 12 Canada
  15. 13 United States of America
  16. 14 United States of America
  17. Name: country, dtype: object
展开查看全部

相关问题