scrapy 从网站抓取电子邮件

h9vpoimq  于 2022-12-13  发布在  其他
关注(0)|答案(2)|浏览(205)

我已经尝试了几次迭代从其他职位和似乎没有什么是帮助或工作为我的需要。
我有一个URL列表,我想循环通过并提取所有包含电子邮件地址的关联URL。然后我想将URL和电子邮件地址存储到一个csv文件中。
例如,如果我访问10torr.com,程序应该在主URL中找到每个站点(即:10torr.com/about)并提取任何电子邮件。
下面是5个示例网站的列表,当运行我的代码时,它们当前是 Dataframe 格式。它们保存在变量small_site下。
一个有用的答案是使用下面列出的用户定义函数get_info()。将网站硬编码到Spider本身并不是一个可行的选择,因为这将被许多具有不同网站列表长度的其他人使用。

Website
    http://10torr.com/
    https://www.10000drops.com/
    https://www.11wells.com/
    https://117westspirits.com/
    https://www.onpointdistillery.com/

下面是我正在运行的代码。蜘蛛似乎在运行,但在我的csv文件中没有输出。

import os
import pandas as pd
import re
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor

small_site = site.head()

#%% Start Spider
class MailSpider(scrapy.Spider):

    name = 'email'

    def parse(self, response):

        links = LxmlLinkExtractor(allow=()).extract_links(response)
        links = [str(link.url) for link in links]
        links.append(str(response.url))

        for link in links:
            yield scrapy.Request(url=link, callback=self.parse_link) 

    def parse_link(self, response):

        for word in self.reject:
            if word in str(response.url):
                return

        html_text = str(response.text)
        mail_list = re.findall('\w+@\w+\.{1}\w+', html_text)

        dic = {'email': mail_list, 'link': str(response.url)}
        df = pd.DataFrame(dic)

        df.to_csv(self.path, mode='a', header=False)
        df.to_csv(self.path, mode='a', header=False)

#%% Preps a CSV File
def ask_user(question):
    response = input(question + ' y/n' + '\n')
    if response == 'y':
        return True
    else:
        return False
def create_file(path):
    response = False
    if os.path.exists(path):
        response = ask_user('File already exists, replace?')
        if response == False: return 

    with open(path, 'wb') as file: 
        file.close()

#%% Defines function that will extract emails and enter it into CSV
def get_info(url_list, path, reject=[]):

    create_file(path)
    df = pd.DataFrame(columns=['email', 'link'], index=[0])
    df.to_csv(path, mode='w', header=True)

    print('Collecting Google urls...')
    google_urls = url_list

    print('Searching for emails...')
    process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
    process.start() 

    for i in small_site.Website.iteritems():
        print('Searching for emails...')
        process.crawl(MailSpider, start_urls=google_urls, path=path, reject=reject)
        ##process.start()

        print('Cleaning emails...')
        df = pd.read_csv(path, index_col=0)
        df.columns = ['email', 'link']
        df = df.drop_duplicates(subset='email')
        df = df.reset_index(drop=True)
        df.to_csv(path, mode='w', header=True)

    return df

url_list = small_site
path = 'email.csv'

df = get_info(url_list, path)

我不确定我哪里出错了,因为我没有收到任何错误信息。如果你需要额外的信息,请直接问。我已经试着得到这个几乎一个月了,现在我觉得我只是在这一点上撞我的头在墙上。
这段代码的大部分是在几周后的文章Web scraping to extract contact information— Part 1: Mailing Lists上找到的。但是,我还没有成功地将它扩展到我的需要。它在一次性使用时没有问题,同时结合了他们的谷歌搜索功能来获得基本URL。
对于您所能提供的任何帮助,在此提前表示感谢。

vxbzzdmp

vxbzzdmp1#

花了一段时间,我终于找到了答案。下面是最终答案是如何得出的。这将适用于一个不断变化的列表,就像最初的问题一样。
最后的变化非常小。我需要添加下面的用户定义函数。

def get_urls(io, sheet_name):
    data = pd.read_excel(io, sheet_name)
    urls = data['Website'].to_list()
    return urls

从这里开始,对get_info()用户定义函数做了一个简单的修改。我们需要将这个函数中的google_urls设置为我们的get_urls函数,并传入列表。这个函数的完整代码如下所示。

def get_info(io, sheet_name, path, reject=[]):
    
    create_file(path)
    df = pd.DataFrame(columns=['email', 'link'], index=[0])
    df.to_csv(path, mode='w', header=True)
    
    print('Collecting Google urls...')
    google_urls = get_urls(io, sheet_name)
    
    print('Searching for emails...')
    process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
    process.crawl(MailSpider, start_urls=google_urls, path=path, reject=reject)
    process.start()
    
    print('Cleaning emails...')
    df = pd.read_csv(path, index_col=0)
    df.columns = ['email', 'link']
    df = df.drop_duplicates(subset='email')
    df = df.reset_index(drop=True)
    df.to_csv(path, mode='w', header=True)
    
    return df

不需要其他的改变就可以运行这个。希望这能有所帮助。

wtlkbnrh

wtlkbnrh2#

我修改了一些脚本,通过Shell运行下面的脚本,它可以工作。可能它会为您提供一个起点。
我建议您使用shell,因为它总是在抓取过程中抛出错误和其他消息

class MailSpider(scrapy.Spider):

    name = 'email'
    start_urls = [
        'http://10torr.com/',
        'https://www.10000drops.com/',
        'https://www.11wells.com/',
        'https://117westspirits.com/',
        'https://www.onpointdistillery.com/',
    ]

    def parse(self, response):
        self.log('A response from %s just arrived!' % response.url)
        links = LxmlLinkExtractor(allow=()).extract_links(response)
        links = [str(link.url) for link in links]
        links.append(str(response.url))

        for link in links:
            yield scrapy.Request(url=link, callback=self.parse_link) 

    def parse_link(self, response):

        html_text = str(response.text)
        mail_list = re.findall('\w+@\w+\.{1}\w+', html_text)

        dic = {'email': mail_list, 'link': str(response.url)}

        for key in dic.keys():
            yield {
                'email' : dic['email'],
                'link': dic['link'],
            }

通过Anaconda shellscrapy crawl email -o test.jl进行爬网时,将生成以下输出

{"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/"}
{"email": ["8b4e078a51d04e0e9efdf470027f0ec1@sentry.wixpress", "bundle@3.2", "fetch@3.0", "bolt@2.3", "5oclock@11wells.com", "5oclock@11wells.com", "5oclock@11wells.com"], "link": "https://www.11wells.com"}
{"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/shop?olsPage=search&keywords="}
{"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/shop?olsPage=search&keywords="}
{"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/shop"}
{"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/shop?olsPage=cart"}
{"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/home"}
{"email": ["8b4e078a51d04e0e9efdf470027f0ec1@sentry.wixpress", "bundle@3.2", "fetch@3.0", "bolt@2.3", "5oclock@11wells.com", "5oclock@11wells.com", "5oclock@11wells.com"], "link": "https://www.11wells.com"}
{"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/home"}
{"email": ["info@ndiscovered.com"], "link": "https://117westspirits.com/117%C2%B0-west-spirits-1"}
...
...
...

有关更多信息,请参阅Scrapy文档

相关问题