我正试着从确实的地方刮出职位空缺。我的刮刀里的东西都能用,除了它只刮第一页。有人知道可能是什么问题吗?
class IndeedSpider(scrapy.Spider):
name = 'indeed'
allowed_domains = ['nl.indeed.com']
start_urls = ['https://nl.indeed.com/vacatures?l=Woerden&limit=50&lang=en&start=0']
def parse(self, response):
urls= response.xpath('//h2[contains(@class, "jobTitle")]/a/@href').extract()
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url=url, callback=self.parse_details)
next_page_url = response.css('ul.pagination-list li:nth-child(7) a::attr(href)').get()
if next_page_url is not None:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
def parse_details(self, response):
Page = response.url
Title = response.css('h1.icl-u-xs-mb--xs.icl-u-xs-mt--none.jobsearch-JobInfoHeader-title ::text').extract_first()
Company = response.css('div.icl-u-lg-mr--sm.icl-u-xs-mr--xs ::text').extract_first()
Location = response.css('.jobsearch-DesktopStickyContainer-companyrating+ div div ::text').extract_first()
Description = response.xpath('normalize-space(//div[contains(@class, "jobsearch-jobDescriptionText")])').extract_first()
Date= response.css('span.jobsearch-HiringInsights-entry--text ::text').extract_first()
yield {
'Page': Page,
'Title': Title,
'Company': Company,
'Location': Location,
'Description': Description,
'Date':Date
}
有人能帮我吗?
1条答案
按热度按时间fquxozlt1#
CSS选择器是错误的,“next_page_url”是None。下一页是第6个子级,但我使用了“last-child”而不是“nth-child”。