我遇到了一些麻烦与Scrapy蜘蛛。
函数parse()没有正常工作。它接收带有搜索关键字的url的响应,然后对页面中的每个列表都跟随url以填充Scrapy Data项。
它有第二个yield,递归调用带有next_page url的parse,直到我们到达max_page,以获取后面页面中的所有列表。
第二个yield在调用scrapy crawl example -o output.json
时没有在output.json文件中返回任何输出
这是一个简化的spider代码工作版本,如果添加到一个零碎的项目中,它可能会重现这个问题。
import scrapy
class Data(scrapy.Item):
page: int = scrapy.Field()
url: str = scrapy.Field()
description: str = scrapy.Field()
user: str = scrapy.Field()
images: list = scrapy.Field()
class Example(scrapy.Spider):
name = 'example'
search = '/search?category=&keyword='
keywords = ['terrains', 'maison', 'land']
max_pages = 2
current_page = 1
def gen_requests(self, url):
for keyword in self.keywords:
build_url = url + self.search
kws = keyword.split(' ')
if (len(kws)>1):
for (i, val) in enumerate(kws):
if (i == 0):
build_url += val
else:
build_url += f'+{val}'
else:
build_url += kws[0]
yield scrapy.Request(build_url, meta={'main_url':url, 'current_page':1}, callback=self.parse)
def start_requests(self):
urls = ['https://ci.coinafrique.com', 'https://sn.coinafrique.com', 'https://bj.coinafrique.com']
for url in urls:
for request in self.gen_requests(url):
yield request
def parse(self, response):
current_page = response.meta['current_page']
main_url = response.meta['main_url']
for listing in response.css('div.col.s6.m4'):
href = listing.xpath('.//p[@class="ad__card-description"]/a/@href').get()
yield scrapy.Request(response.urljoin(href), meta={'current_page':current_page}, callback=self.followListing)
try:
next_page_url = response.css('li.pagination-indicator.direction a::attr(href)')[1].get()
if next_page_url is not None and current_page < self.max_pages:
next_page = main_url + '/search' + next_page_url
current_page += 1
yield scrapy.Request(next_page, meta={'main_url':main_url, 'current_page':1}, callback=self.parse)
except:
print('No next page found')
def followListing(self, response):
url = response.url
current_page = response.meta['current_page']
description = response.xpath('//div[@class="ad__info__box ad__info__box-descriptions"]//text()').getall()[1]
profile = response.css('div.profile-card__content')
user = profile.xpath('.//p[@class="username"]//text()').get()
images = []
for image in response.xpath('//div[contains(@class,"slide-clickable")]/@style').re(r'url\((.*)\)'):
images.append(image)
yield Data(
page=current_page,
url=url,
description=description,
user=user,
images=images
)
如果我交换parse()函数中的yield,它只返回max_page(例如page 2)列表,看起来在两种情况下都只返回第一个yield的结果。
def parse(self, response):
current_page = response.meta['current_page']
main_url = response.meta['main_url']
try:
next_page_url = response.css('li.pagination-indicator.direction a::attr(href)')[1].get()
if next_page_url is not None and current_page < self.max_pages:
next_page = main_url + '/search' + next_page_url
current_page += 1
yield scrapy.Request(next_page, meta={'main_url':main_url, 'current_page':1}, callback=self.parse)
except:
print('No next page found')
for listing in response.css('div.col.s6.m4'):
href = listing.xpath('.//p[@class="ad__card-description"]/a/@href').get()
yield scrapy.Request(response.urljoin(href), meta={'current_page':current_page}, callback=self.followListing)
1条答案
按热度按时间jckbn6z71#
scraby没有使用requests meta字典在request方法之间传递变量,而是使用
cb_kwargs
参数来实现这一点,但在本例中,这两个参数实际上都不是必需的。它不起作用的原因是你构造下一个页面的url失败了,所以你可以不使用
main_url
和current_page
变量,而是从页面底部的pagination元素获取当前页面,方法是查找类名为active
的页面链接。然后获取该元素的兄弟元素来查找下一页,然后可以使用response.urljoin
重新构建相对链接。例如:
您可以在
followlisting
方法中执行相同的操作,以获取当前页。总的来说,你的蜘蛛看起来像这样:
使用运行上述命令的部分输出
scrapy crawl example -o results.json