我写了一个简单的蜘蛛,我想跟踪一个域中的所有链接(在这个例子中是amazon.com)。这是我目前为止的代码
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from urllib.parse import urlparse
from scrapy.utils.response import open_in_browser
class ExampleSpider(CrawlSpider):
name = 'example'
allowed_domains = ['amazon.com']
rules = (
Rule(LinkExtractor(allow='',
deny_extensions=['7z', '7zip', 'apk', 'bz2', 'cdr,' 'dmg', 'ico,' 'iso,' 'tar', 'tar.gz','pdf','docx'],
), callback='parse_item', follow=True,
),
)
custom_settings = {'LOG_ENABLED':True}
def start_requests(self):
#print(self.website)
url = 'https://www.amazon.com/s?k=water+balloons'
yield scrapy.Request(url,callback=self.parse_item,)
def parse_item(self,response):
#open_in_browser(response)
print(response.url)
我检查了这个问题,但答案不工作scrapy follow all the links and get status,我也尝试取代allow=''
与restrict_xpaths='\\a'
,但它没有解决,任何帮助是感激不尽的.
注意:爬行器必须位于“amazon.com“域中
1条答案
按热度按时间yb3bgrhw1#
您已经正确指定了规则,但代码的问题是您没有在
start_requests
方法中调用正确的方法。为了触发规则,您需要将第一个请求发送到内置的
parse
方法大概是这样的: