如何在scrapy shell而不是spider代码中修复scrapy退货项目

wb1gzix0  于 2022-11-09  发布在  Shell
关注(0)|答案(1)|浏览(142)

我在Scrapy shell中测试了选择器,它们都能工作。当我只使用player url aline时,它能工作。但是当spider从代码中爬出来时,它返回None,我用这种方式获取player的url。

class TransfersSpider(scrapy.Spider):
    name = "transfers"
    start_urls = []
    for year in range(1970, 2022)[29:31]:  # for each transfer year
            url = f"https://www.transfermarkt.com/transfers/saisontransfers/statistik/top/ajax/yw0/saison_id/{year}/transferfenster/alle/land_id//ausrichtung//spielerposition_id//altersklasse//leihe//plus/0/galerie/0/"
            start_urls.append(url)   

    def parse(self, response):    
        for page_num in range(1,2):
            page_url = response.url + f"page/{page_num}?/ajax=yw0"
            yield scrapy.Request(page_url, callback=self.parse_page)

    def parse_page(self,response):
        players_urls_ =  response.css('table[class="items"] > tbody > tr > td>table>tr>td').css('a[href*=profil]::attr(href)').getall()[0:1]
        players_urls = ["https://www.transfermarkt.com"+url for url in players_urls_]

        for player_url in players_urls:
            yield scrapy.Request(player_url, callback=self.parse_info)

    def parse_info(self, response):
        item = dict()       
        info_table = response.css('div[class="large-6 large-pull-6 small-12 columns spielerdatenundfakten"]')
        name = response.xpath('//h1//text()').getall()
        name = ' '.join([i.strip() for i in name if (i.strip() and i.strip().isalpha())])      
        item["name"] = name     
        item["date_of_birth"] =  info_table.xpath('//span[contains(text(), "Date of birth:")]/following-sibling::span/a/text()').get(default="").strip()      
        item["place_of_birth"] = info_table.xpath('//span[contains(text(), "Place of birth:")]/following-sibling::span/span/text()').get(default="").strip()       
        item["height"] = info_table.xpath('//span[contains(text(), "Height:")]/following-sibling::span/text()').get(default="").strip()     
        item["citizenship"] =  info_table.xpath('//span[contains(text(), "Citizenship:")]/following-sibling::span/img/@title').get().strip()               
        item["foot"] = info_table.xpath('//span[contains(text(), "Foot:")]/following-sibling::span/text()').get(default="").strip()        
        player_agent =  info_table.xpath('//span[contains(text(), "Player agent:")]/following-sibling::span/span/text()').get(default="").strip() #player agent has two possible xpaths
        item["player_agent"] = info_table.xpath('//span[contains(text(), "Player agent:")]/following-sibling::span/span/a/text()').get(default="").strip() or player_agent   
        item["main_position"] = response.css('div[class="detail-position__inner-box"] > dl> dd::text').get(default="").strip()
        item["other_position"] = response.css('div[class="detail-position__position"] > dl>  dd::text').getall() or ["none"]        
        item["outfitter"] = info_table.xpath('//span[contains(text(), "Outfitter:")]/following-sibling::span/text()').get(default="").strip()        

        yield item

使用常规三脚架时的输出:

2022-09-29 19:15:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.transfermarkt.com/christian-vieri/profil/spieler/5797>
{'citizenship': '',
 'date_of_birth': '',
 'foot': '',
 'height': '',
 'main_position': 'Centre-Forward',
 'name': 'Christian Vieri',
 'other_position': ['none'],
 'outfitter': '',
 'place_of_birth': '',
 'player_agent': ''}
2022-09-29 19:15:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.transfermarkt.com/luis-figo/profil/spieler/3446>
{'citizenship': '',
 'date_of_birth': '',
 'foot': '',
 'height': '',
 'main_position': 'Right Winger',
 'name': 'Luís Figo',
 'other_position': ['none'],
 'outfitter': '',
 'place_of_birth': '',
 'player_agent': ''}

OUTPUT当我只使用玩家的url时:

2022-09-29 19:12:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.transfermarkt.com/christian-vieri/profil/spieler/5797> (referer: None)
2022-09-29 19:12:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.transfermarkt.com/christian-vieri/profil/spieler/5797>
{'name': 'Christian Vieri', 'date_of_birth': 'Jul 12, 1973', 'place_of_birth': 'Bologna', 'height': '1,85\xa0m', 'citizenship': 'Italy', 'foot': 'left', 'player_agent': '', 'main_position': '', 'other_position': ['none'], 'outfitter': ''}
2022-09-29 19:12:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.transfermarkt.com/luis-figo/profil/spieler/3446> (referer: None)
2022-09-29 19:12:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.transfermarkt.com/luis-figo/profil/spieler/3446>
{'name': 'Luís Figo', 'date_of_birth': 'Nov 4, 1972', 'place_of_birth': 'Almada', 'height': '1,80\xa0m', 'citizenship': 'Portugal', 'foot': 'right', 'player_agent': '', 'main_position': 'Right Winger', 'other_position': ['Right Midfield', 'Attacking Midfield'], 'outfitter': ''}

为什么选择器在scrapy shell中可以工作,或者在scrapy spider中只能使用玩家的url,而在scrapy spider中却不行?为什么只有name和main_position选择器可以工作,而其他的都不行?

ktecyv1j

ktecyv1j1#

我尝试诊断来自函数parse_info的响应,并在解析之前添加了以下内容:

def parse_info(self, response):
    from scrapy.shell import inspect_response
    inspect_response(response, self)

这打开了一个scrapy shell,它使我能够以scrapy看到的方式查看响应,并测试选择器。显然,网站上的选择器与scrapy看到的不一样。info_table选择器应该改为response.css('div[class="large-12 small-12 columns spielerdatenundfakten"]'),这样就可以工作了。

相关问题