regex 如何使用Python从简历中提取经验持续时间

kupeojn6  于 2023-05-30  发布在  Python
关注(0)|答案(2)|浏览(197)

我写了逻辑从简历中提取经验的日期。我已经提取了具有这种格式的经验:
2017年1月-2022年4月
2017年7月1日-2017年7月31日
2017年3月至2022年7月
下面是代码:

cur_datespan = None
    next_first_date = None
    delimeter_count = 0

    for ptoken, token in zip(tokens, tokens[1:]):
        token = str(token).lower().strip()
        ptoken = str(ptoken).lower().strip()
        tokenpair = token + " " + ptoken
        # find datespanes
        if re.search("\d+", token) != None:
            dates = search_dates(tokenpair, settings={
                                 'REQUIRE_PARTS': ['month', 'year']}) or []

        
        else:
            dates = []
        for date in dates:
            if next_first_date == None:
                next_first_date = date[1]
                delimeter_count = 0
            elif delimeter_count < 6:
                cur_datespan = (next_first_date, date[1])
                next_first_date = None
            else:
                next_first_date = date[1]
                delimeter_count = 0
        if delimeter_count > 50:
            next_first_date = None
            cur_datespan = None
        delimeter_count += len(token.split(" "))
        # find skill and add to dict with associated datespan
        if token.lower() in skills:
            skillset[cur_datespan].add(token)
        elif (ptoken + " " + token).lower() in skills:
            skillset[cur_datespan].add((ptoken + " " + token).lower())

    skilldict = {}
    for datespan, skills in skillset.items():
        for skill in skills:
            if skill not in skilldict:
                skilldict[skill] = []
            if datespan != None and datespan[1].month - datespan[0].month > 0:
                skilldict[skill].append(datespan)

    return skilldict

但我无法提取具有这些格式的经验,例如:
2020年3 - 7月
2020年3月-当前/现在
01/07/2017-31/07/2017(日期格式“day_first”)
2020-2021
自2020年起
2020年3月至2022年7月

ecfsfe2w

ecfsfe2w1#

可以按如下方式使用re.findall

import re
resumes = ['01/2017 - 04/2022',
'01/07/2017 - 31/07/2017',
'March 2017 - July 2022',
'March-July 2020',
'March 2020 - Current/Present',
'01/07/2017-31/07/2017',
'2020-2021',
'From/Since 2020',
'From March 2020 to July 2022]']
pattern = r'(((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|June?|July?|Aug(ust)?|Sep(tember)?|Nov(ember)?|Dec(ember)?)|(\d{1,2}\/){0,2})[- ]?\d{4}?)'

for resume in resumes:
    res = re.findall(pattern,resume)
    if len(res) > 1:
      print('from',res[0][0],'to',res[1][0])
    else:

输出

from 01/2017 to 04/2022
from 01/07/2017 to 31/07/2017
from March 2017 to July 2022
-> March-July 2020
-> March 2020 - Current/Present
from 01/07/2017 to 31/07/2017
from 2020 to -2021
-> From/Since 2020
from March 2020 to July 2022
icomxhvb

icomxhvb2#

导入日期查找程序
进口环
模式= r '(\d{1,2}\s?\d{4})|(\d{4}\s?\d{1,2})|(\d{4})'

# Find all matches in the text

matches = re.findall(pattern,text)

# Extract the matched durations

duration = [] for match in matches:对于比赛中的组:if group:duration.append(group)
打印(持续时间)

初始化一个空列表来存储提取的月份和年份值

extracted_dates = []

遍历数据列表中的元素

对于持续时间中的项目:matches = datefinder.find_dates(item)

# Check if any matches were found
for match in matches:
    # Extract the month and year from the match
    month = match.month
    year = match.year
    
    # Append the extracted month and year to the extracted_dates list
    extracted_dates.append((month, year))

从extracted_dates列表中获取唯一年份

unique_years = set(year for month,year in extracted_dates)

打印提取的每个唯一年份的月份和年份值

unique_years中的年份:has_month = False for month,y in extracted_dates:如果y ==年份:has_month = True print(f“Year:{year},月份:{month if month else '-'}”)if not has_month:print(f“年份:{year},月份:- ”)
输出:
年:2017年,月:2
年:2019年,月:4
年:2019年,月:4
年:2021年,月:3
年:2021年,月:2
年:2022年,月:7
年:2022年,月:6
年:2013年,月:12

相关问题