为什么我的蜘蛛爬行700+项目实际上有245个项目?怎么?
有没有超过245项,但我的蜘蛛是如何刮700+项,即使我使用for循环只爬取选定的页面,并试图设置我的CLOSESPIDER_ITEMCOUNT = 244在setting.py
有什么可能解决办法吗?
这是我代码。
import scrapy
import json
from ..items import HmsItem
from scrapy.loader import ItemLoader
class HmSpider(scrapy.Spider):
name = 'hm'
allowed_domains = ['hm.com']
def start_requests(self):
for i in range(36,252, 36): #there is a diff of 36 on each next url
yield scrapy.Request(
url = f"https://www2.hm.com/en_us/men/new-arrivals/view-all/_jcr_content/main/productlisting.display.json?sort=stock&image-size=small&image=model&offset=0&page-size={i}",
method='GET',
callback= self.parse
)
def parse(self, response):
# with open('initial.json', 'wb') as f:
# f.write(response.body)
json_resp = json.loads(response.body)
products = json_resp.get('products')
for product in products:
loader = ItemLoader(item=HmsItem())
title = loader.add_value('title', product.get('title'))
articleCode = loader.add_value('articleCode', product.get('articleCode'))
category = loader.add_value('category', product.get('category'))
src = loader.add_value('src', product.get('image')[0].get('src'))
price = loader.add_value('price', product.get('price'))
swatchesTotal = loader.add_value('swatchesTotal', product.get('swatchesTotal'))
brandName =loader.add_value('brandName', product.get('brandName'))
yield loader.load_item()
这里是我的setting.py
BOT_NAME = 'hms'
SPIDER_MODULES = ['hms.spiders']
NEWSPIDER_MODULE = 'hms.spiders'
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 8
FEED = 'json'
FEED_EXPORT_ENCODING = 'utf-8'
CLOSESPIDER_ITEMCOUNT = 244
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0
这里是我的items.py
import scrapy
class HmsItem(scrapy.Item):
title = scrapy.Field()
articleCode = scrapy.Field()
category = scrapy.Field()
src = scrapy.Field()
price = scrapy.Field()
swatchesTotal = scrapy.Field()
brandName = scrapy.Field()
1条答案
按热度按时间f2uvfpb91#
由于scrapy不能将params(查询字符串键值对)作为参数,所以不能使用range函数和for循环进行分页。
实际上,您的分页方法会迭代ResultSet并生成重复项。
另一种方法是获得正确的输出,即在
page-size=
存在的url的终点中注入总项243
。工作代码:
输出: