我一直试图通过处理一个小的招聘数据集来学习Python数据分析。
数据集表由三列组成;职务、国家和职务描述。
- 职务**列中的某些单元格包含如下值:
- "急需数据分析师-一年合同"
- "需要初级数据分析师,心胸宽广者优先"
- “高级数据分析职位,两个空缺,会游泳者优先”
我想把它改成一些简单的东西,可以像这样聚合:* "数据分析师"*
在大量搜索和浏览numpy和pandas文档后,我找不到任何函数的唯一目的是这样做,我发现的所有函数都是按照用新的表达式替换found表达式的方式来做的,所以我想无论如何我都必须使用if语句,所以我做了下一节所示的代码。但是我觉得它太慢太长了,一定有更好的方法来做这件事对吗?
***TLDR;代码又慢又长,因为我使用了太多的if,不知道还能做什么。
这就是代码:
导入(我使用谷歌jupyter笔记本)
import numpy as np
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
读取csv并删除一些列(总行数约为40k)
jobscsv = pd.read_csv("/content/drive/MyDrive/allJobs.csv")
jobscsv.head
jobscsv_modified = jobscsv.drop(['Date-Posted', 'Salary', 'Company', 'Job-Type'], axis=1)
jobscsv_modified.head
这是有问题的部分
success_counter = 0
for i in range(0, len(jobscsv_modified)):
if "Data Analyst" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Analyst'
success_counter = success_counter + 1
if "Data Analysis" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Analyst'
success_counter = success_counter + 1
if "Data Analytics" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Analyst'
success_counter = success_counter + 1
##
if "Data Scientist" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Scientist'
success_counter = success_counter + 1
if "Data Science" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Scientist'
success_counter = success_counter + 1
if "Data Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Scientist'
success_counter = success_counter + 1
if "Data Architect" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Scientist'
success_counter = success_counter + 1
if "Machine Learning" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Scientist'
success_counter = success_counter + 1
if " AI " in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Scientist'
success_counter = success_counter + 1
if "Artificial Intelligence" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Scientist'
success_counter = success_counter + 1
if "Database" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Database Engineer'
success_counter = success_counter + 1
if "Data Entry" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Entry'
success_counter = success_counter + 1
##
if "Customer Service" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Customer Service'
success_counter = success_counter + 1
if "Sales" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Sales'
success_counter = success_counter + 1
##
if "Software" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Programmer'
success_counter = success_counter + 1
if "Web Development" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Programmer'
success_counter = success_counter + 1
if "Stack" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Programmer'
success_counter = success_counter + 1
##
if "Volunteer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Volunteer Work'
success_counter = success_counter + 1
if "Laboratory" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Research'
success_counter = success_counter + 1
if "Research" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Research'
success_counter = success_counter + 1
if "PhD" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'PhD'
success_counter = success_counter + 1
if "Teacher" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Education'
success_counter = success_counter + 1
if "Lecturer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Education'
success_counter = success_counter + 1
if "Principal" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Education'
success_counter = success_counter + 1
if "Dean" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Education'
success_counter = success_counter + 1
if "School" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Education'
success_counter = success_counter + 1
if "Student" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Education'
success_counter = success_counter + 1
if "Instructor" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Education'
success_counter = success_counter + 1
if "Postdoctoral" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Research'
success_counter = success_counter + 1
##
if "Mechanical Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Mechanical Engineer'
success_counter = success_counter + 1
if "Industrial Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Industrial Engineer'
success_counter = success_counter + 1
if "Mechatronics Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Mechatronics Engineer'
success_counter = success_counter + 1
if "Electrical Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Electrical Engineer'
success_counter = success_counter + 1
if "Civil Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Civil Engineer'
success_counter = success_counter + 1
if "Chemical Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Chemical Engineer'
success_counter = success_counter + 1
if "Process Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Process Engineer'
success_counter = success_counter + 1
if "Lab Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Research'
success_counter = success_counter + 1
我已经到了一个地步,我相当有信心没有人这样做,哈哈。任何关于改进这一点的投入都是高度赞赏的。非常感谢您提前提供的所有宝贵意见。
2条答案
按热度按时间e5njpo681#
我会使用字典来减少所有这些
if
s,然后在第一个匹配时减少break
。tv6aics12#
假设您实际上不需要跟踪成功,您可以使用
replace
方法。对于replace
方法,您可以传递一个替换字典,其中键是要查找的内容,值是要替换的内容。因为你想替换整个字符串,如果它包含关键字,我们可以使用regex。我们在这里使用的正则表达式是
(^.*<word>.*$)
。(...)
:这使所有内容成为一个组。^
:匹配单词的开头。.*
:.
表示“匹配任何字符”,*
表示“匹配零个或多个字符”。这意味着它将匹配零个或多个任何字符。<word>
:这是你想让正则表达式搜索的单词,即。“PhD”或“Machine Learning”。$
:匹配单词的结尾。因为在每个替换的周围都写正则表达式会很麻烦,所以我们将编写一个函数来将替换转换为正则表达式形式。
现在你要这么做
1.创建一个替换的字典,以及用什么来替换它们,例如。
{"Data Analyst":"Data Analyst", "Data Analysis":"Data Analyst", ..., "Software":"Programmer",...}
。1.通过将字典传递给函数来“正则化”它。
1.使用
regex=True
调用replace
方法。您也可以选择使用inplace=True
。示例:
输出: