pandas 缓慢的代码;在列中搜索关键字,然后相应地更改整个单元格的值

s71maibg  于 2023-06-28  发布在  其他
关注(0)|答案(2)|浏览(92)

我一直试图通过处理一个小的招聘数据集来学习Python数据分析。
数据集表由三列组成;职务、国家和职务描述。

    • 职务**列中的某些单元格包含如下值:
  • "急需数据分析师-一年合同"
  • "需要初级数据分析师,心胸宽广者优先"
  • “高级数据分析职位,两个空缺,会游泳者优先”

我想把它改成一些简单的东西,可以像这样聚合:* "数据分析师"*
在大量搜索和浏览numpy和pandas文档后,我找不到任何函数的唯一目的是这样做,我发现的所有函数都是按照用新的表达式替换found表达式的方式来做的,所以我想无论如何我都必须使用if语句,所以我做了下一节所示的代码。但是我觉得它太慢太长了,一定有更好的方法来做这件事对吗?

***TLDR;代码又慢又长,因为我使用了太多的if,不知道还能做什么。

这就是代码:
导入(我使用谷歌jupyter笔记本)

import numpy as np
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

读取csv并删除一些列(总行数约为40k)

jobscsv = pd.read_csv("/content/drive/MyDrive/allJobs.csv")

jobscsv.head

jobscsv_modified = jobscsv.drop(['Date-Posted', 'Salary', 'Company', 'Job-Type'], axis=1)

jobscsv_modified.head

这是有问题的部分

success_counter = 0

for i in range(0, len(jobscsv_modified)):

    if "Data Analyst" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Analyst'
      success_counter = success_counter + 1

    if "Data Analysis" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Analyst'
      success_counter = success_counter + 1

    if "Data Analytics" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Analyst'
      success_counter = success_counter + 1
##
    if "Data Scientist" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Scientist'
      success_counter = success_counter + 1

    if "Data Science" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Scientist'
      success_counter = success_counter + 1

    if "Data Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Scientist'
      success_counter = success_counter + 1

    if "Data Architect" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Scientist'
      success_counter = success_counter + 1

    if "Machine Learning" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Scientist'
      success_counter = success_counter + 1

    if " AI " in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Scientist'
      success_counter = success_counter + 1

    if "Artificial Intelligence" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Scientist'
      success_counter = success_counter + 1

    if "Database" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Database Engineer'
      success_counter = success_counter + 1

    if "Data Entry" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Entry'
      success_counter = success_counter + 1
##
    if "Customer Service" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Customer Service'
      success_counter = success_counter + 1

    if "Sales" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Sales'
      success_counter = success_counter + 1
##
    if "Software" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Programmer'
      success_counter = success_counter + 1

    if "Web Development" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Programmer'
      success_counter = success_counter + 1

    if "Stack" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Programmer'
      success_counter = success_counter + 1
##
    if "Volunteer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Volunteer Work'
      success_counter = success_counter + 1

    if "Laboratory" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Research'
      success_counter = success_counter + 1

    if "Research" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Research'
      success_counter = success_counter + 1

    if "PhD" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'PhD'
      success_counter = success_counter + 1

    if "Teacher" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Education'
      success_counter = success_counter + 1

    if "Lecturer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Education'
      success_counter = success_counter + 1

    if "Principal" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Education'
      success_counter = success_counter + 1

    if "Dean" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Education'
      success_counter = success_counter + 1

    if "School" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Education'
      success_counter = success_counter + 1

    if "Student" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Education'
      success_counter = success_counter + 1

    if "Instructor" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Education'
      success_counter = success_counter + 1

    if "Postdoctoral" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Research'
      success_counter = success_counter + 1

##

    if "Mechanical Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Mechanical Engineer'
      success_counter = success_counter + 1

    if "Industrial Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Industrial Engineer'
      success_counter = success_counter + 1

    if "Mechatronics Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Mechatronics Engineer'
      success_counter = success_counter + 1

    if "Electrical Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Electrical Engineer'
      success_counter = success_counter + 1

    if "Civil Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Civil Engineer'
      success_counter = success_counter + 1

    if "Chemical Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Chemical Engineer'
      success_counter = success_counter + 1

    if "Process Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Process Engineer'
      success_counter = success_counter + 1

    if "Lab Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Research'
      success_counter = success_counter + 1

我已经到了一个地步,我相当有信心没有人这样做,哈哈。任何关于改进这一点的投入都是高度赞赏的。非常感谢您提前提供的所有宝贵意见。

e5njpo68

e5njpo681#

我会使用字典来减少所有这些if s,然后在第一个匹配时减少break

title_map = {
    'Data Analyst': 'Analyst', 'Data Analysis': 'Analyst', 'Data Analytics': 'Analyst',
    'Data Scientist': 'Scientist', 'Data Science': 'Scientist', 'Data Engineer': 'Scientist',
    'Data Architect': 'Scientist', 'Machine Learning': 'Scientist', 'AI': 'Scientist',
    'Artificial Intelligence': 'Scientist', 'Database': 'Engineer', 'Data Entry': 'Entry',
    'Customer Service': 'Service', 'Sales': 'Sales', 'Software': 'Programmer',
    'Web Development': 'Programmer', 'Stack': 'Programmer', 'Volunteer': 'Work',
    'Laboratory': 'Research', 'Research': 'Research', 'PhD': 'PhD', 'Teacher': 'Education',
    'Lecturer': 'Education', 'Principal': 'Education', 'Dean': 'Education',
    'School': 'Education', 'Student': 'Education', 'Instructor': 'Education',
    'Postdoctoral': 'Research', 'Mechanical Engineer': 'Engineer',
    'Industrial Engineer': 'Engineer', 'Mechatronics Engineer': 'Engineer',
    'Electrical Engineer': 'Engineer', 'Civil Engineer': 'Engineer',
    'Chemical Engineer': 'Engineer', 'Process Engineer': 'Engineer',
    'Lab Engineer': 'Research'}

success_counter = 0

for i in range(0, len(jobscsv_modified)):
    
    for key in title_map:
        if key in jobscsv_modified['Job-Title'][i]:
            jobscsv_modified['Job-Title'][i] = title_map[key]
            success_counter += 1
            break
tv6aics1

tv6aics12#

假设您实际上不需要跟踪成功,您可以使用replace方法。对于replace方法,您可以传递一个替换字典,其中键是要查找的内容,值是要替换的内容。因为你想替换整个字符串,如果它包含关键字,我们可以使用regex
我们在这里使用的正则表达式是(^.*<word>.*$)

  • (...):这使所有内容成为一个组。
  • ^:匹配单词的开头。
  • .*.表示“匹配任何字符”,*表示“匹配零个或多个字符”。这意味着它将匹配零个或多个任何字符。
  • <word>:这是你想让正则表达式搜索的单词,即。“PhD”或“Machine Learning”。
  • $:匹配单词的结尾。

因为在每个替换的周围都写正则表达式会很麻烦,所以我们将编写一个函数来将替换转换为正则表达式形式。

regexify = lambda d: {f"(^.*{key}.*$)":value for key, value in d.items()}

现在你要这么做
1.创建一个替换的字典,以及用什么来替换它们,例如。{"Data Analyst":"Data Analyst", "Data Analysis":"Data Analyst", ..., "Software":"Programmer",...}
1.通过将字典传递给函数来“正则化”它。
1.使用regex=True调用replace方法。您也可以选择使用inplace=True
示例:

import pandas as pd
import random

# I put some random characters before and after the word to simulate your data
old = ["salkd old 1 asfgds", "asdfa old 2 ffdz", "afsfd old 3 asfds"] 

df = pd.DataFrame({"A":random.choices(old, k=12)})
print(df)

replacements = {"old 1":"new 1", "old 2":"new 2", "old 3":"new 3"}
regexify = lambda d: {f"(^.*{key}.*$)":value for key, value in d.items()}
replacements_regexed = regexify(replacements)

df["A"] = df["A"].replace(replacements_regexed, regex=True)
# df["A"].replace(replacements_regexed, regex=True, inplace=True)  # same result
print(df)

输出:

# Before replacement
                     A
0     asdfa old 2 ffdz
1   salkd old 1 asfgds
2     asdfa old 2 ffdz
3    afsfd old 3 asfds
4     asdfa old 2 ffdz
5    afsfd old 3 asfds
6     asdfa old 2 ffdz
7     asdfa old 2 ffdz
8     asdfa old 2 ffdz
9   salkd old 1 asfgds
10    asdfa old 2 ffdz
11   afsfd old 3 asfds

# After replacement
        A
0   new 2
1   new 1
2   new 2
3   new 3
4   new 2
5   new 3
6   new 2
7   new 2
8   new 2
9   new 1
10  new 2
11  new 3

相关问题