我有一个代码,可以完美地保存来自单个网站的电子邮件:
import re
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.powercurbers.com/dealers/?region=13&area=318')
padla = driver.page_source
suka = r'''(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])'''
gnida = []
for huj in re.finditer(suka, padla):
gnida.append(huj.group())
我现在想合并这个代码与一个获取所有网站从谷歌seach.我面临2个问题:首先:我看到brwoeser最多得到100个结果,页面确实有100个结果,但下面的代码只返回10个网站:
driver = webdriver.Chrome()
driver.get("https://www.google.com/search?q=%D0%B3%D1%80%D1%83%D0%B7%D0%BE%D0%BF%D0%B5%D1%80%D0%B5%D0%B2%D0%BE%D0%B7%D0%BA%D0%B8+%D0%B0%D1%80%D1%85%D0%B0%D0%BD%D0%B3%D0%B5%D0%BB%D1%8C%D1%81%D0%BA+%D0%B0%D1%81%D1%82%D1%80%D0%B0%D1%85%D0%B0%D0%BD%D1%8C&ei=AQhEZNPgBtSF9u8Pop-4qA4&ved=0ahUKEwiT5YuZ8b3-AhXUgv0HHaIPDuUQ4dUDCBA&uact=5&oq=%D0%B3%D1%80%D1%83%D0%B7%D0%BE%D0%BF%D0%B5%D1%80%D0%B5%D0%B2%D0%BE%D0%B7%D0%BA%D0%B8+%D0%B0%D1%80%D1%85%D0%B0%D0%BD%D0%B3%D0%B5%D0%BB%D1%8C%D1%81%D0%BA+%D0%B0%D1%81%D1%82%D1%80%D0%B0%D1%85%D0%B0%D0%BD%D1%8C&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIHCCEQoAEQCjIHCCEQoAEQCjoECAAQRzoFCAAQgAQ6BggAEBYQHjoFCCEQoAE6BAghEBVKBAhBGABQmAZYthhgjRxoAHACeACAAaEHiAHOHJIBDTAuMS4wLjEuMi4xLjKYAQCgAQHIAQjAAQE&sclient=gws-wiz-serp")
results_list = driver.find_elements(By.TAG_NAME, 'cite')
for i in range(len(results_list)):
results_list[i] = results_list[i].text.replace(">", "/").replace("›", "/").replace(" ", "")
if not validators.url(results_list[i]):
results_list[i] = ''
results_list = list(filter(None, results_list))
列表的长度是10。为什么?有办法获取所有的站点吗?
第二:如何可以写一个循环来执行电子邮件抓取每个网站?当我写:
gnida = []
import re
for h in results_list:
padla = driver.page_source
for huj in re.finditer(suka, padla):
gnida.append(huj.group())
gnida列表是空的。非常感谢任何帮助。
1条答案
按热度按时间31moq8wy1#
谷歌比雅虎更难刮,从技术上讲违反了政策。如果雅虎都是一样的,那么这里有一个关于如何获得顶级结果链接的选项: