Scrapy链接提取器ScraperApi集成

vbopmzt1  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(202)

我正在尝试从网页中提取链接,但我必须使用代理服务。如果我使用代理服务,链接无法正确提取。提取的链接缺少https://www.homeadvisor.com部分。提取的链接使用api.scraperapi.com作为域,但没有网站域。如何解决此问题?
第一个

ykejflvf

ykejflvf1#

看起来在使用ScraperAPIClient的时候,它要求你对每个请求都使用特定的语法client.scrapyGet(url=...)。但是,由于你使用的是带有链接提取器的crawlspider,scrapy会自动以它通常的方式发送请求,所以这些请求会被阻止。你最好自己提取所有的链接,然后过滤你想跟踪的链接。
例如:

import scrapy
from scraper_api import ScraperAPIClient

client = ScraperAPIClient("67e5e7755771b9abf8062e595dd5cc2a")

class Sip2Spider(scrapy.Spider):
    name = 'sip2'
    domain = 'https://www.homeadvisor.com'
    start_urls =[client.scrapyGet(url = 'https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html')]

    def parse(self, response):
        print(response)
        links = [self.domain + i if not i.startswith('https://') else i for i in response.xpath("//a/@href").getall()]
        yield {"links" : list(set(links))}

这将产生:

[
  {
    "links": [
      "https://www.homeadvisor.com/rated.TapConstructionLLC.42214874.html",
      "https://www.homeadvisor.com#quote=42214874",
      "https://www.homeadvisor.com/emc.Drywall-Plaster-directory.-12025.html",
      "https://www.linkedin.com/company/homeadvisor/",
      "https://www.homeadvisor.com/c.Additions-Remodeling.Philadelphia.PA.-12001.html",
      "https://www.homeadvisor.com/login",
      "https://www.homeadvisor.com/task.Major-Home-Repairs-General-Contractor.40062.html",
      "https://www.homeadvisor.com/near-me/home-addition-builders/",
      "https://www.homeadvisor.com/c.Additions-Remodeling.Lawrenceville.GA.-12001.html",
      "https://www.homeadvisor.com/near-me/carpentry-contractors/",
      "https://www.homeadvisor.com/emc.Roofing-directory.-12061.html",
      "https://www.homeadvisor.com/c.Doors.Atlanta.GA.-12024.html",
      "https://www.homeadvisor.com#quote=20057351",
      "https://www.homeadvisor.com/near-me/deck-companies/",
      "https://www.homeadvisor.com/tloc/Atlanta-GA/Bathroom-Remodel/",
      "https://www.homeadvisor.com/c.Additions-Remodeling.Knoxville.TN.-12001.html",
      "https://www.homeadvisor.com/xm/35317287/task-selection/-12001?postalCode=30301",
      "https://www.homeadvisor.com/category.Additions-Remodeling.12001.html",
      "https://www.homeadvisor.comtel:4042672949",
      "https://www.homeadvisor.com/rated.DCEnclosuresInc.16852798.html",
      "https://www.homeadvisor.com#quote=16721785",
      "https://www.homeadvisor.com/near-me/bathroom-remodeling/",
      "https://www.homeadvisor.com/near-me",
      "https://www.homeadvisor.com/emc.Heating-Furnace-Systems-directory.-12040.html",
      "https://pro.homeadvisor.com/r/?m=sp_pro_center&entry_point_id=33522463",
      "https://www.homeadvisor.com/r/hiring-a-home-architect/",
      "https://www.homeadvisor.com#quote=119074241",
      "https://www.homeadvisor.comtel:8669030759",
      "https://www.homeadvisor.com/rated.SilverOakRemodel.78475581.html#ratings-reviews",
      "https://www.homeadvisor.com/emc.Tree-Service-directory.-12074.html",
      "https://www.homeadvisor.com/task.Bathroom-Remodel.40129.html",
      "https://www.homeadvisor.com/rated.G3BuildersLLC.71091804.html",
      "https://www.homeadvisor.com/sp/horizon-remodeling-construction",
      "https://www.homeadvisor.com/near-me/fence-companies/",
      "https://www.homeadvisor.com/emc.Gutters-directory.-12038.html",
      "https://www.homeadvisor.com/c.GA.html#topcontractors",
      ...
      ...
      ...
    ]
  }
]

实际输出几乎是400个链接...
然后你可以使用某种过滤来决定你想跟随哪些链接,并使用相同的api sdk语法来跟随它们。应用某种过滤系统也将减少发送的请求数量,这将节省api调用,这也将保存你的钱。
例如:

def parse(self, response):
        print(response)
        links = [self.domain + i if not i.startswith('https://') else i for i in response.xpath("//a/@href").getall()]
        yield {"links" : list(set(links))}
        # some filtering process
        for link in links:
            yield scrapy.Request(client.scrapyGet(url = link))

最新消息:
试试这个...

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from urllib.parse import urlencode
APIKEY = "67e5e7755771b9abf8062e595dd5cc2a"  # <- your api key
APIDOMAIN = "http://api.scraperapi.com/"
DOMAIN = 'https://www.homeadvisor.com/'

def get_scraperapi_url(url):
    payload = {'api_key': APIKEY, 'url': url}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

def process_links(links):
    for link in links:
        i = link.url.index('rated')
        link.url = DOMAIN + link.url[i:]
        link.url = get_scraperapi_url(link.url)
    return links

class Sip2Spider(CrawlSpider):
    name = 'sip2'
    domain = 'https://www.homeadvisor.com'
    start_urls =[get_scraperapi_url('https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html')]

    rules= [
        Rule(LinkExtractor(allow="/rated"), callback="parse_page", follow=True, process_links=process_links)
    ]

    def parse_page(self, response):
        company_name = response.xpath("//h1[contains(@class,'@w-full @text-3xl')]/text()").get()
        yield {
            "company_name" : company_name
        }

相关问题