Scrapy - 400回应(含信头)

inn6fuwd  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(149)

这里有一个例子链接,我试图刮:https://www.lowes.com/pd/ZLINE-KITCHEN-BATH-Professional-7-Burners-4-cu-ft-2-cu-ft-Double-Oven-Convection-Dual-Fuel-Range-Stainless-Steel-Common-48-in-Actual-48-in/1000514227
我的刮刀一直工作良好,直到今天,所以我猜劳的增加了更多的保护,对机器人:(
经过一些研究,我发现我将不得不添加标题到我的网页刮刀,这样我就可以模拟一个真实的的用户。
打开开发控制台-〉网络-〉XHR/获取-〉找到JSON文件。
这是我的剧本


# -*- coding: utf-8 -*-

import scrapy
from ..items import LowesItem
import re
import pandas as pd
import requests
import json
from scrapy.http import Request
from datetime import date

class LowesSpider(scrapy.Spider):
    name = 'Lowes'

    def start_requests(self):

        HEADERS = {
            'method': 'GET',
            'scheme': 'https',
            'authority': 'content.syndigo.com',
            'Accept': '*/*',
            'Content-Type': 'text/plain',
            'Origin': 'https://lowes.com',
            'Accept-Language': 'en-US,en;q=0.9',
            'Host': 'content.syndigo.com',
            'User-Agent': ' Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.4 Safari/605.1.15',
            'Referer': 'https://www.lowes.com/',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive',
            'Cookie': 'sn=0321'

        }

        start_urls = ['https://www.lowes.com/pd/ZLINE-KITCHEN-BATH-Professional-7-Burners-4-cu-ft-2-cu-ft-Double-Oven-Convection-Dual-Fuel-Range-Stainless-Steel-Common-48-in-Actual-48-in/1000514227']

        for url in start_urls:
            yield Request(url,
                        headers=HEADERS,
                        meta={'dont_merge_cookies': True,
                          'url':url}) 

    def parse(self, response):
        for item in self.parseLowes(response):
            yield item
        pass

    def parseLowes(self, response):
        item = LowesItem() #items from items.py

        script_tag = response.xpath('//script[@type="application/ld+json"]/text()').get() #get js container
        productPrice = json.loads(script_tag)[2]["offers"]["price"]

        productURL = response.url
        url = response.meta['url']
        productSKU = url.split("=")[-1]

        scrapedDate = date.today()

        #item['productName'] = productName #display product name
        item['productOMS'] = productSKU
        item['productPrice'] = productPrice #display price and assign to variable
        item['productURL'] = productURL #displayURL
        item['scrapedDate'] = scrapedDate
        yield item

当我运行scrapy时,我从命令中得到400作为响应。

mrzz3bfm

mrzz3bfm1#

从我所看到的网络连接,问题是有关他们的CDN(Akamai),这是阻止访问。
我能够访问您的链接并从Microsoft Edge(版本107)看到产品。在我的请求中,用户代理是:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.26

因此,请尝试在代码中使用该值修改“User-Agent”键。

相关问题