这是我用来抓取网站的代码
import scrapy
class DellLatitudeSpider(scrapy.Spider):
name = 'dell_latitude'
allowed_domains = ['www.dell.com/community']
def start_request(self):
yield scrapy.Request(url='https://www.dell.com/community/Latitude/bd-p/Latitude?ref=lithium_menu', callback=self.parse, headers='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36')
def parse(self, response):
for activity in response.xpath("//table[@class='lia-list-wide']"):
yield {
'board': 'Laptops',
'sub board': 'Latitude',
'title': activity.xpath("//a[@class='page-link lia-link-navigation lia-custom-event']/text()").get(),
'url': activity.xpath("//a[@class='page-link lia-link-navigation lia-custom-event']/@href").get()
}
最初,我得到了一个错误,我通过添加User-Agent纠正了这个错误,但是现在,当我抓取文件时,它显示抓取了0个页面,其他什么都没有。
下面是我获得输出的方法。
2022-07-08 17:38:01 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-07-08 17:38:01 [scrapy.core.engine] INFO: Spider opened
2022-07-08 17:38:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-07-08 17:38:01 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-07-08 17:38:01 [scrapy.core.engine] INFO: Closing spider (finished)
2022-07-08 17:38:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.005837,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 7, 8, 12, 8, 1, 807968),
'log_count/DEBUG': 1,
'log_count/INFO': 10,
'start_time': datetime.datetime(2022, 7, 8, 12, 8, 1, 802131)}
2022-07-08 17:38:01 [scrapy.core.engine] INFO: Spider closed (finished)
1条答案
按热度按时间goqiplq21#
你的
headers
参数应该是一个将头文件名称Map到头文件值的dict
,不是吗?headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36'}