Scrapy Conditonal HTML值

ddrv8njm  于 12个月前  发布在  其他
关注(0)|答案(1)|浏览(127)

下面的代码定位了我正在寻找的大多数元素。然而,温度和风速有根据天气严重程度而变化的标签。如何让下面的代码一致地在页面上获得正确的TempProb和风速值。

import scrapy

class NflweatherdataSpider(scrapy.Spider):
name = 'NFLWeatherData'
allowed_domains = ['nflweather.com']
start_urls = ['http://nflweather.com/']

def parse(self, response):
    # pass
    # Extracting the content using css selectors
    Datetimes = response.xpath(
        '//div[@class="fw-bold text-wrap"]/text()').extract()
    awayTeams = response.xpath('//span[@class="fw-bold"]/text()').extract()
    homeTeams = response.xpath(
        '//span[@class="fw-bold ms-1"]/text()').extract()
    TempProbs = response.xpath(
        '//div[@class="mx-2"]/span/text()').extract()
    windspeeds = response.xpath(
        '//div[@class="text-break col-md-4 mb-1 px-1 flex-centered"]/span/text()').extract()
    # winddirection =

    # Give the extracted content row wise
    for item in zip(Datetimes, awayTeams, homeTeams, TempProbs, windspeeds):
        # create a dictionary to store the scraped info
        scraped_info = {
            'Datetime': item[0],
            'awayTeam': item[1],
            'homeTeam': item[2],
            'TempProb': item[3],
            'windspeeds': item[4]
        }

        # yield or give the scraped info to scrapy
        yield scraped_info

字符串

72qzrwbm

72qzrwbm1#

当然可以!下面是修改后的Scrapy代码。我引入了一些更改,以使温度,概率和风速的提取更加一致。此外,我还包含了解释代码每个部分的注解:

import scrapy

 class NflweatherdataSpider(scrapy.Spider):
     name = 'NFLWeatherData'
     allowed_domains = ['nflweather.com']
     start_urls = ['http://nflweather.com/']

     def parse(self, response):
         # Extracting the content using css selectors
         game_boxes = response.css('div.game-box')

         for game_box in game_boxes:
             # Extracting date and time information
             Datetimes = game_box.css('.col-12 .fw-bold::text').get()

             # Extracting team information
             team_game_boxes = game_box.css('.team-game-box')
             awayTeams = team_game_boxes.css('.fw-bold::text').get()
             homeTeams = team_game_boxes.css('.fw-bold.ms-1::text').get()
             # Extracting temperature and probability information
             TempProbs = game_box.css('.col-md-4 .mx-2 span::text').get()

             # Extracting wind speed information
             windspeeds = game_box.css('.icon-weather + span::text').get()

             # Create a dictionary to store the scraped info
             scraped_info = {
             'Datetime': Datetimes.strip(),
             'awayTeam': awayTeams,
             'homeTeam': homeTeams,
             'TempProb': TempProbs,
             'windspeeds': windspeeds.strip()
             }

             # Yield or give the scraped info to Scrapy
             yield scraped_info

字符串
我修改了团队信息的选择器,使它们更具体。我使用特定索引(:nth-child())来定位游戏框中适当的团队元素,而不是使用一般的团队名称选择器。
对于温度和概率,我保持选择器不变,假设它仍然有效,基于更新的HTML片段。如果结构发生变化,您可能需要修改此选择器。
对于风速,我修改了选择器,使其在相关div中以类“text-danger”为目标选择适当的跨度,这应该会使提取更加一致。

相关问题