scrapy 如果找不到空值,如何导入空值到csv?[Python,剪贴画,网页剪贴]

1mrurvl1  于 2022-12-18  发布在  Python
关注(0)|答案(2)|浏览(117)

我正在写我的第一个网页报废项目,我想从booking.com报废。
我想取消酒店包含早餐的信息。
问题是-我希望每个值都是[“Brekafast included”]或空值[""],如果没有关于它的信息。如果我运行我的代码(如下),我只得到几个值[“Brekafast included”]。
我不知道该怎么解决这个问题,因为早餐不包含在房费里,酒店的房卡上没有“e05969 d 63 d”(如果房费包含早餐,这个类是关于早餐的信息)。
所以如果酒店1和酒店3有“含早餐”,而酒店2没有含早餐。
我想导出类似[“包含早餐”,"",“包含早餐”]的内容
但我只得到:[“含早餐”,“含早餐”]

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


import scrapy
import logging
from scrapy.crawler import CrawlerProcess
from scrapy.exporters import CsvItemExporter

class CsvPipeline(object):
    def __init__(self):
        self.file = open ('hotel.tmp','wb')
        self.exporter = CsvItemExporter(self.file,str)
        self.exporter.start_exporting()
    def close_spider(self,spider):
        self.exporter.finish_exporting()
        self.file.close()
    def process_items(self,item,spider):
        self.exporter.export_item(item)
        return item
class hotelsNY(scrapy.Spider):
    name = "hotelsNY"
    start_urls =[]
    #start_urls = ['https://www.booking.com/searchresults.pl.html?label=gen173nr-1BCAEoggI46AdIM1gEaLYBiAEBmAEeuAEXyAEM2AEB6AEBiAIBqAIDuALX3uicBsACAdICJGRlODkzYmJmLTIyZjQtNDYwNi04YzYwLWIxOWRlMGU0MmM0MdgCBeACAQ&sid=7ab6fb8585341629f1a790546e37a1c5&aid=304142&ss=Nowy+Jork&ssne=Nowy+Jork&ssne_untouched=Nowy+Jork&lang=pl&sb=1&src_elem=sb&src=index&dest_id=20088325&dest_type=city&checkin=2022-12-30&checkout=2023-01-03&group_adults=2&no_rooms=1&group_children=0&sb_travel_purpose=leisure&offset=0']
    for i in range (0, 10):
        start_urls.append('https://www.booking.com/searchresults.pl.html?label=gen173nr-1BCAEoggI46AdIM1gEaLYBiAEBmAEeuAEXyAEM2AEB6AEBiAIBqAIDuALX3uicBsACAdICJGRlODkzYmJmLTIyZjQtNDYwNi04YzYwLWIxOWRlMGU0MmM0MdgCBeACAQ&sid=7ab6fb8585341629f1a790546e37a1c5&aid=304142&ss=Nowy+Jork&ssne=Nowy+Jork&ssne_untouched=Nowy+Jork&lang=pl&sb=1&src_elem=sb&src=index&dest_id=20088325&dest_type=city&checkin=2022-12-30&checkout=2023-01-03&group_adults=2&no_rooms=1&group_children=0&sb_travel_purpose=leisure&offset=' + str(i*25))
        
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'FEED_EXPORTERS': {'csv': 'scrapy.exporters.CsvItemExporter'},
        'FEED_FORMAT': 'csv',
        'FEED_URI': 'hotels_tmp1.csv'
    }
   

    def parse(self, response):
        nexturl = 'https://www.booking.com/searchresults.pl.html?label=gen173nr-1BCAEoggI46AdIM1gEaLYBiAEBmAEeuAEXyAEM2AEB6AEBiAIBqAIDuALX3uicBsACAdICJGRlODkzYmJmLTIyZjQtNDYwNi04YzYwLWIxOWRlMGU0MmM0MdgCBeACAQ&sid=7ab6fb8585341629f1a790546e37a1c5&aid=304142&ss=Nowy+Jork&ssne=Nowy+Jork&ssne_untouched=Nowy+Jork&lang=pl&sb=1&src_elem=sb&src=index&dest_id=20088325&dest_type=city&checkin=2022-12-30&checkout=2023-01-03&group_adults=2&no_rooms=1&group_children=0&sb_travel_purpose=leisure&offset=0'
        #all_names = response.xpath('//*[@data-testid="title"]')
        alH = response.xpath('//*[@data-testid="property-card"]').getall()
        for name in allH:
            hotelName = response.xpath('//*[@data-testid="title"]/text()').extract(),
            address = response.xpath('//*[@data-testid="address"]/text()').extract(),
            price = response.xpath('//*[@data-testid="price-and-discounted-price"]/text()').extract(),
            dist = response.xpath('//span[@data-testid="distance"]/text()').extract(),
            breakfast = response.xpath('//span[@class="e05969d63d"]/text()').extract(),
            yield {'hotelName': hotelName, 'address': address, 'price': price, 'dist': dist, 'breakfast': breakfast}
process = CrawlerProcess(
    {
     'USER_AGENT':'Mozilla/4.0 (comatible;MSIE 7.0;Window NT 5.1)'
     })
process.crawl(hotelsNY)
process.start()
6ljaweal

6ljaweal1#

你的蜘蛛有点问题。
1.一旦你在allHxxpath上使用getall(),你就提取了那个xpath表达式的文本,并且你不能再把它用作你可以链接的选择器。
1.使用带有链式选择器的相对XPath表达式,这样就不用提取匹配元素的列表,而是逐行迭代页面,我认为这是您最初的意图。
1.要确保“breakfast”成为空字符串,您可以测试它是否为None,并在需要时显式地将其设置为空字符串。
下面是一个例子:
注意在for循环中的XPath表达式中有一个'.//'.这些是相对的XPath表达式.并且还注意我是如何通过在for循环中调用i.xpath而不是response.xpath来链接选择器的.

allH = response.xpath('//*[@data-testid="property-card"]')
        for i in allH:
            hotelName = i.xpath('.//*[@data-testid="title"]//text()').get()
            address = i.xpath('.//*[@data-testid="address"]//text()').get()
            price = i.xpath('.//*[@data-testid="price-and-discounted-price"]//text()').get()
            dist = i.xpath('.//span[@data-testid="distance"]//text()').get()
            breakfast = i.xpath('//span[@class="e05969d63d"]//text()').get()
            if breakfast is None:
                breakfast = ""
            yield {'hotelName': hotelName, 'address': address, 'price': price, 
                  'dist': dist, 'breakfast': breakfast}
os8fio9y

os8fio9y2#

您当前根本没有使用for name in allH循环,而且在上面的行中,您将其定义为alH而不是allH
我建议您像from bs4 import BeautifulSoup这样导入BeautifulSoup,然后将for循环更改为以下内容:

for name in alH:
    hotel = BeautifulSoup(name.extract(), features="lxml")
    hotelName = hotel.find(attrs={"data-testid":"title"}).get_text()
    print(hotelName)
    address = hotel.find(attrs={"data-testid":"address"}).get_text()
    price = hotel.find(attrs={"data-testid": "price-and-discounted-price"}).get_text()
    dist = hotel.find(attrs={"data-testid": "distance"}).get_text()
    breakfast = hotel.find(class_="e05969d63d")

    if breakfast:
        breakfast = breakfast.get_text()
    else:
        breakfast = " "
    print(breakfast)
    yield {'hotelName': hotelName, 'address': address, 'price': price, 'dist': dist, 'breakfast': breakfast}

使用BeautifulSoup,你可以更容易地从html和xml文件中提取数据,你也可以在代码中使用它来替换任何xpath调用,这只是一个如何使用它的快速示例,但我建议你进一步研究这个工具。

相关问题