我试图刮信息如下所示,但他们将提供错误的输出。我在做什么错误?这是页面链接https://www.thegrommet.com/products/the-vintage-pearlmini-peas-in-the-pod-necklace
from scrapy import Spider
from scrapy.http import Request
class AuthorSpider(Spider):
name = 'book'
start_urls = ['https://www.thegrommet.com/gifts/by-type/personalized-gifts']
custom_settings = {
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'DOWNLOAD_DELAY': 1,
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
def parse(self, response):
books = response.xpath("//div[@class='flex-grow | p-t-s']//@href").extract()
for book in books:
url = response.urljoin(book)
yield Request(url, callback=self.parse_book)
def parse_book(self, response):
title=response.xpath("//div[@class='f-heading-xl']//text()").get()
title=title.strip()
d3=response.xpath("//div[@class='accordion-section | p-t-s p-b-m']")
for pro in d3:
data=[tup for tup in pro.xpath('//div//text()')]
try:
trip=data[1].get()
except:
trip=''
trip=trip.strip()
try:
tuck=data[2].get()
except:
tuck=''
tuck=tuck.strip()
try:
tup=data[3].get()
except:
tup=''
tup=tup.strip()
yield{
'title':title,
'd1':trip,
'd2':tuck,
'd3':tup,
}
PIC2:
1条答案
按热度按时间bjp0bcyl1#
你可以用下面的方法选择
d1,d2,d3
的xpath表达式,并且不需要使用try,除非scrapy本身处理None值。你也可以使用scrapy内置的方法normalize-space
来删除前导和尾随的空格和换行符。完整的工作代码: