下面的Scrapy CrawlSpider类代码用于通过data.ok.gov
页面的分页来抓取链接。
class OklahomaFinanceSpider(CrawlSpider):
name = "OklahomaFinanceSpider"
allowed_domains = ["data.ok.gov"]
start_urls = [
"http://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191"
]
rules = (
Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//li[@class="pager-next"]',)), callback="parse_page", follow= True),
)
def parse_page(self, response):
for href in response.xpath('//*[contains(concat(" ", normalize-space(@class), " "),"search-results apachesolr_search-results")]/h3/a/@href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
然而,第一页并没有被刮掉,我在规则上犯了什么错误?
1条答案
按热度按时间llmtgqce1#
在这里发表@paul trmbrth评论作为回答。
若要剖析在
start_urls
撷取的页面,请在parse_page(self, response)
定义之后设定parse_start_url = parse_page after
。