使用以下代码,可以准确地获取所有值本身。但是,检索值的一部分(折换日期)无法由ItemLoader处理,因此按原样输出。
经过各种验证后,我们发现在parse_firstpage_item中获得的值似乎没有传递到Itemloader。由parse_productpage_item检索到的每个项都得到了正确的处理。
我已经验证了Itemloader中的处理器描述也是正确的,因为如果将值传递给Itemloader,输出将以所需的形式输出。
因此,我假设对蜘蛛的描述有问题。
我是一个初学者,所以真的很难理解数据是如何在Scrapy上处理的...
class AllSaledataSpider(CrawlSpider):
name = 'all_salesdata_copy'
allowed_domains = ['www.buyma.com']
#Read from shopper URL list saved in csv file
def start_requests(self):
with open('/Users/morni/BUYMA/buyma_researchtool/AllshoppersURL.csv', 'r', encoding='utf-8-sig') as f:
reader = csv.reader(f)
for row in reader:
for n in range(1, 3): #300
url =str((row[2])[:-5]+'/sales_'+str(n)+'.html')
# f'{self.base_page}{row}/sales_{n}.html'
yield scrapy.Request(
url=url,
callback=self.parse_firstpage_item,
# errback=self.errback_httpbin,
dont_filter=True
)
# Obtain the 30 product links on the order history page, and the date of sign-up (obtained here, since they are not listed on the individual product pages), store these two pieces of information in item, and pass the request to the next parsing method.
def parse_firstpage_item(self, response):
conversion_date = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[3]/text()').getall()
product_url = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[1]/a/@href').getall()
for i in range(2):
loader = ItemLoader(item = BuymaResearchtoolItem(), response = response)
loader.add_value("conversion_date", conversion_date[i])
loader.add_value("product_url", product_url[i])
item = loader.load_item()
yield scrapy.Request(
url=response.urljoin(item["product_url"][-1]),
callback=self.parse_productpage_item,
cb_kwargs={'item': item},
)
#Retrieve the content in the product detail page and merge it with the previously retrieved information
def parse_productpage_item(self, response, item):
loader = ItemLoader(item=item, response = response)
loader.add_xpath("product_name", 'normalize-space(//li[@class="page_ttl"]/span[1]/text())')
loader.add_xpath("brand_name", 'normalize-space(//[@id="s_brand"]/dd/a/text())')
〜〜
yield loader.load_item()
Itemloader的过程如下
def strip_n(element):
if element:
return element.replace('\t', '').replace('\n', '')
return element
def conversion_dates(element):
if element:
str = element.replace('成約日:', '')
dte = datetime.datetime.strptime(str, '%Y/%m/%d')
return dte
return element
class BuymaResearchtoolItem(scrapy.Item):
# first_page
conversion_date = scrapy.Field(
input_processors = MapCompose(conversion_dates),
output_processors = TakeFirst()
)
product_url = scrapy.Field(
output_processors = TakeFirst()
)
# product_page
product_name = scrapy.Field(
input_processors = MapCompose(strip_n),
output_processors = TakeFirst()
)
brand_name = scrapy.Field(
input_processors = MapCompose(strip_n),
output_processors = TakeFirst()
)
1条答案
按热度按时间bsxbgnwa1#
我注意到了几个问题。第一个问题是缩进量太大,肯定会引发错误。假设这只是一个复制和粘贴问题,尽管您在
conversion_dates
函数中使用str
作为变量,但它也是一个类型名。最后一个问题是您在Item类中为每个字段使用了不正确的关键字参数。修复这些小问题应该会使您的spider按预期运行。
例如: