scrapy 如何处理大规模网页搜罗?

wxclj1h5  于 2022-11-09  发布在  其他
关注(0)|答案(3)|浏览(222)

情况:

我最近开始使用selenium和scrapy进行网页抓取,我正在做一个项目,我有一个csv文件,其中包含42000个邮政编码,我的工作是获取该邮政编码,然后在这个site上输入邮政编码,并抓取所有结果。

问题:

这里的问题是,在这样做的时候,我必须不断地点击“加载更多”按钮,直到所有的结果都显示出来,只有一旦完成了,我才能收集数据。
这可能不是一个大问题,但它需要2分钟,这样做,每个邮政编码,我有42000这样做。

代码:

import scrapy
    from numpy.lib.npyio import load
    from selenium import webdriver
    from selenium.common.exceptions import ElementClickInterceptedException, ElementNotInteractableException, ElementNotSelectableException, NoSuchElementException, StaleElementReferenceException
    from selenium.webdriver.common.keys import Keys
    from items import CareCreditItem
    from datetime import datetime
    import os

    from scrapy.crawler import CrawlerProcess
    global pin_code
    pin_code = input("enter pin code")

    class CareCredit1Spider(scrapy.Spider):

        name = 'care_credit_1'
        start_urls = ['https://www.carecredit.com/doctor-locator/results/Any-Profession/Any-Specialty//?Sort=D&Radius=75&Page=1']

        def start_requests(self):

            directory = os.getcwd()
            options = webdriver.ChromeOptions()
            options.headless = True

            options.add_experimental_option("excludeSwitches", ["enable-logging"])
            path = (directory+r"\\Chromedriver.exe")
            driver = webdriver.Chrome(path,options=options)

            #URL of the website
            url = "https://www.carecredit.com/doctor-locator/results/Any-Profession/Any-Specialty/" +pin_code + "/?Sort=D&Radius=75&Page=1"
            driver.maximize_window()
            #opening link in the browser
            driver.get(url)
            driver.implicitly_wait(200)

            try:
                cookies = driver.find_element_by_xpath('//*[@id="onetrust-accept-btn-handler"]')
                cookies.click()
            except:
                pass

            i = 0
            loadMoreButtonExists = True
            while loadMoreButtonExists:
                try:
                    load_more =  driver.find_element_by_xpath('//*[@id="next-page"]')
                    load_more.click()    
                    driver.implicitly_wait(30)
                except ElementNotInteractableException:
                    loadMoreButtonExists = False
                except ElementClickInterceptedException:
                    pass
                except StaleElementReferenceException:
                    pass
                except NoSuchElementException:
                    loadMoreButtonExists = False

            try:
                previous_page = driver.find_element_by_xpath('//*[@id="previous-page"]')
                previous_page.click()
            except:
                pass

            name = driver.find_elements_by_class_name('dl-result-item')
            r = 1
            temp_list=[]
            j = 0
            for element in name:
                link = element.find_element_by_tag_name('a')
                c = link.get_property('href')
                yield scrapy.Request(c)

        def parse(self, response):
            item = CareCreditItem()
            item['Practise_name'] = response.css('h1 ::text').get()
            item['address'] = response.css('.google-maps-external ::text').get()
            item['phone_no'] = response.css('.dl-detail-phone ::text').get()
            yield item
    now = datetime.now()
    dt_string = now.strftime("%d/%m/%Y")
    dt = now.strftime("%H-%M-%S")
    file_name = dt_string+"_"+dt+"zip-code"+pin_code+".csv"
    process = CrawlerProcess(settings={
        'FEED_URI' : file_name,
        'FEED_FORMAT':'csv'
    })
    process.crawl(CareCredit1Spider)
    process.start()
    print("CSV File is Ready")

项目.py

import scrapy

    class CareCreditItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        Practise_name = scrapy.Field()
        address = scrapy.Field()
        phone_no = scrapy.Field()

问题:

本质上我的问题很简单。有没有一种方法可以优化代码,使其执行得更快?或者有没有其他可能的方法可以处理数据抓取,而不会永远占用时间?

vh0rcniy

vh0rcniy1#

由于站点是从api动态加载数据的,所以你可以直接从api中检索数据,这将大大加快速度,但我还是会实现一个等待,以避免达到速率限制。

import requests
import time
import pandas as pd

zipcode = '00704'
radius = 75
url = f'https://www.carecredit.com/sites/ContentServer?d=&pagename=CCGetLocatorService&Zip={zipcode}&City=&State=&Lat=&Long=&Sort=D&Radius={radius}&PracticePhone=&Profession=&location={zipcode}&Page=1'
req = requests.get(url)
r = req.json()
data = r['results']

for i in range(2,r['maxPage']+1):
    url = f'https://www.carecredit.com/sites/ContentServer?d=&pagename=CCGetLocatorService&Zip={zipcode}&City=&State=&Lat=&Long=&Sort=D&Radius={radius}&PracticePhone=&Profession=&location={zipcode}&Page={i}'
    req = requests.get(url)
    r = req.json()
    data.extend(r['results'])
    time.sleep(1)

df = pd.DataFrame(data)
df.to_csv(f'{pd.Timestamp.now().strftime("%d/%m/%Y_%H-%M-%S")}zip-code{zipcode}.csv')
ct3nt3jp

ct3nt3jp2#

有多种方法可以实现这一点。

1.创建一个分布式系统,在其中通过多台计算机运行蜘蛛,以便并行运行。

在我看来,这是更好的选择,因为您还可以创建一个可扩展的动态解决方案,您将能够使用多次。
有很多方法来实现这一点,通常包括将种子列表(邮政编码)划分为许多单独的种子列表,以便让单独的进程与单独的种子列表一起工作,因此下载将并行运行,例如,如果在2台机器上,它将快2倍,但如果在10台机器上,它将快10倍,等等。
为了做到这一点,我可能会建议看看AWS,即AWS LambdaAWS EC2 Instances,甚至AWS Spot Instances这些都是我以前工作过的,他们并不很难工作。

2.或者,如果您希望在一台机器上运行该进程,可以查看Multithreading with Python,它可以帮助您在这台机器上并行运行该进程。
3.这是另一个选项,特别是当它是一次性进程时。您可以尝试仅通过请求运行它,这可能会加快它的速度,但如果使用大量种子,则开发并行运行的进程通常会更快。

bgtovc5b

bgtovc5b3#

只要RJ Adriaansen方法在此特定情况下有效,我想强调的是,Scrapy是此类任务的首选框架,因为它:因此,我将发布一个用Scrapy制作的解决方案,并提供一些额外的选项来使用无头浏览器作为API,以提高JS繁重的刮取作业的速度和规模。
这是代码在使用ScrapyAPI调用的特定示例中的显示方式。


# -*- coding: utf-8 -*-

import scrapy
from scrapy import Request
from scrapy.shell import inspect_response
import json

class CodeSpider(scrapy.Spider):
    name = "code"
    count = 0
    def start_requests(self):
        zip_codes = ['00704','00705','00706']
        radius = 75

        for zip in zip_codes:
            url = 'https://www.carecredit.com/sites/ContentServer?d=&pagename=CCGetLocatorService&Zip={}&City=&State=&Lat=&Long=&Sort=D&Radius={}&PracticePhone=&Profession=&location={}&Page=1'.format(zip, radius, zip)
            yield  Request(url)

    def parse(self, response):
        d = json.loads(response.text)

        # Check if response returns any results
        if 'results' in d:
            data = d['results']

            for i in data:

                yield {
                       'Practise_name': i['name'],
                       'address': i['address1'],
                       'phone_no': i['phone'],
                       }

        # Check if response returns pagination key
        if 'maxPage' in d:
            pagination = (2,d['maxPage']+1)
            for page in pagination:
                url = response.url.replace('Page=1','Page={}'.format(page))
                yield Request(url, callback = self.parse)

代码将返回以下结果:

{
  'Practise_name': 'Walgreens 00973',
  'address': 'Eljibaro Ave And Pr 172',
  'phone_no': '(787) 739-4386'
}

当然,您可以将其保存为JSON或CSV。这取决于您的需要。根据Scrapy help

> ## Serialization formats[¶](https://docs.scrapy.org/en/latest/topics/feed-exports.html#serialization-formats
> "Permalink to this heading")
> 
> For serializing the scraped data, the feed exports use the  [Item
> exporters](https://docs.scrapy.org/en/latest/topics/exporters.html#topics-exporters).
> These formats are supported out of the box:
> 
> -   [JSON](https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-format-json)
>     
> -   [JSON lines](https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-format-jsonlines)
>     
> -   [CSV](https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-format-csv)
>     
> -   [XML](https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-format-xml)
>     
> 
> But you can also extend the supported format through the 
> [`FEED_EXPORTERS`](https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEED_EXPORTERS)
> setting.

以下是一些API服务,它们支持Headless Browsers,使您的Scrapy spider在处理大量JavaScript网站时运行得更快:

  1. Web Unlocker-瀑布代理管理、JS渲染、CAPTCHA解决程序、浏览器指纹优化。高级解决方案
    1.代理,JS渲染,提供免费试用和不错的文档。
    1.这个解决方案很难用,而且很贵。
  2. PhantoJsCloud-代理,JS渲染,截图等。提供免费试用。文档可以更友好的用户。
  3. ScraperApi-代理,JS渲染,提供免费试用版和不错的文档。
  4. Scrapy selenium middlwear-免费的原生中间件,以增强Scrapy无头体验。

相关问题