regex 在HTML脚本webscraping中查找电子邮件地址的正则表达式代码

yduiuuwa 于 2023-04-22 发布在其他

关注(0)|答案(1)|浏览(107)

我试图提取电话，地址和电子邮件从公司网站夫妇通过网页抓取
我的代码如下

l = 'https://www.zimmermanfinancialgroup.com/about'
address_t = []
phone_num_t = []

    # make a request to the link
response = requests.get(l)

soup = BeautifulSoup(response.content, "html.parser")
#soup = BeautifulSoup(response.content, 'html.parser')

phone_regex = "(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}"
    # extract the phone number information
match = soup.findAll(string=re.compile(phone_regex))

if match:
    print("Found the matching string:", match)
else:
    print("Matching string not found")

# extract email address information

mail = "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b"

match_a = soup.findAll(string=re.compile(mail))

match_a

上面的代码工作正常，它正确地提取电话号码，但它无法检测电子邮件地址，与其他网站（https://www.benefitexperts.com/about-us/）相同的问题

regex

来源：https://stackoverflow.com/questions/76058487/regex-code-to-find-email-address-within-html-script-webscraping

1条答案

按热度按时间

shyt4zoc1#

您正在寻找的邮件地址位于标记的href属性（如果存在），作为字符串'mailto：somemail@adrress.com'。因此，您只需将href作为关键字参数传递给findall函数，以便它将匹配所有具有href作为属性的节点并匹配正则表达式。
在BeautifulSoup官方文档www.example.com上查看更多关于关键字参数的信息https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#the-keyword-arguments
或者干脆

match_a = soup.findAll(href=re.compile(mail))

你做一些清理提取正是邮件地址

match_a = [a['href'].strip('mailto:') for a in match_a]

赞(0）回复(0）举报 2023-04-22

我来回答

regex 在HTML脚本webscraping中查找电子邮件地址的正则表达式代码

1条答案

相关问题

热门标签

最新问答