这里有一个例子链接,我试图刮:https://www.lowes.com/pd/ZLINE-KITCHEN-BATH-Professional-7-Burners-4-cu-ft-2-cu-ft-Double-Oven-Convection-Dual-Fuel-Range-Stainless-Steel-Common-48-in-Actual-48-in/1000514227
我的刮刀一直工作良好,直到今天,所以我猜劳的增加了更多的保护,对机器人:(
经过一些研究,我发现我将不得不添加标题到我的网页刮刀,这样我就可以模拟一个真实的的用户。
打开开发控制台-〉网络-〉XHR/获取-〉找到JSON文件。
这是我的剧本
# -*- coding: utf-8 -*-
import scrapy
from ..items import LowesItem
import re
import pandas as pd
import requests
import json
from scrapy.http import Request
from datetime import date
class LowesSpider(scrapy.Spider):
name = 'Lowes'
def start_requests(self):
HEADERS = {
'method': 'GET',
'scheme': 'https',
'authority': 'content.syndigo.com',
'Accept': '*/*',
'Content-Type': 'text/plain',
'Origin': 'https://lowes.com',
'Accept-Language': 'en-US,en;q=0.9',
'Host': 'content.syndigo.com',
'User-Agent': ' Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.4 Safari/605.1.15',
'Referer': 'https://www.lowes.com/',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Cookie': 'sn=0321'
}
start_urls = ['https://www.lowes.com/pd/ZLINE-KITCHEN-BATH-Professional-7-Burners-4-cu-ft-2-cu-ft-Double-Oven-Convection-Dual-Fuel-Range-Stainless-Steel-Common-48-in-Actual-48-in/1000514227']
for url in start_urls:
yield Request(url,
headers=HEADERS,
meta={'dont_merge_cookies': True,
'url':url})
def parse(self, response):
for item in self.parseLowes(response):
yield item
pass
def parseLowes(self, response):
item = LowesItem() #items from items.py
script_tag = response.xpath('//script[@type="application/ld+json"]/text()').get() #get js container
productPrice = json.loads(script_tag)[2]["offers"]["price"]
productURL = response.url
url = response.meta['url']
productSKU = url.split("=")[-1]
scrapedDate = date.today()
#item['productName'] = productName #display product name
item['productOMS'] = productSKU
item['productPrice'] = productPrice #display price and assign to variable
item['productURL'] = productURL #displayURL
item['scrapedDate'] = scrapedDate
yield item
当我运行scrapy时,我从命令中得到400作为响应。
1条答案
按热度按时间mrzz3bfm1#
从我所看到的网络连接,问题是有关他们的CDN(Akamai),这是阻止访问。
我能够访问您的链接并从Microsoft Edge(版本107)看到产品。在我的请求中,用户代理是:
因此,请尝试在代码中使用该值修改“User-Agent”键。