我在Scrapy shell中测试了选择器,它们都能工作。当我只使用player url aline时,它能工作。但是当spider从代码中爬出来时,它返回None
,我用这种方式获取player的url。
class TransfersSpider(scrapy.Spider):
name = "transfers"
start_urls = []
for year in range(1970, 2022)[29:31]: # for each transfer year
url = f"https://www.transfermarkt.com/transfers/saisontransfers/statistik/top/ajax/yw0/saison_id/{year}/transferfenster/alle/land_id//ausrichtung//spielerposition_id//altersklasse//leihe//plus/0/galerie/0/"
start_urls.append(url)
def parse(self, response):
for page_num in range(1,2):
page_url = response.url + f"page/{page_num}?/ajax=yw0"
yield scrapy.Request(page_url, callback=self.parse_page)
def parse_page(self,response):
players_urls_ = response.css('table[class="items"] > tbody > tr > td>table>tr>td').css('a[href*=profil]::attr(href)').getall()[0:1]
players_urls = ["https://www.transfermarkt.com"+url for url in players_urls_]
for player_url in players_urls:
yield scrapy.Request(player_url, callback=self.parse_info)
def parse_info(self, response):
item = dict()
info_table = response.css('div[class="large-6 large-pull-6 small-12 columns spielerdatenundfakten"]')
name = response.xpath('//h1//text()').getall()
name = ' '.join([i.strip() for i in name if (i.strip() and i.strip().isalpha())])
item["name"] = name
item["date_of_birth"] = info_table.xpath('//span[contains(text(), "Date of birth:")]/following-sibling::span/a/text()').get(default="").strip()
item["place_of_birth"] = info_table.xpath('//span[contains(text(), "Place of birth:")]/following-sibling::span/span/text()').get(default="").strip()
item["height"] = info_table.xpath('//span[contains(text(), "Height:")]/following-sibling::span/text()').get(default="").strip()
item["citizenship"] = info_table.xpath('//span[contains(text(), "Citizenship:")]/following-sibling::span/img/@title').get().strip()
item["foot"] = info_table.xpath('//span[contains(text(), "Foot:")]/following-sibling::span/text()').get(default="").strip()
player_agent = info_table.xpath('//span[contains(text(), "Player agent:")]/following-sibling::span/span/text()').get(default="").strip() #player agent has two possible xpaths
item["player_agent"] = info_table.xpath('//span[contains(text(), "Player agent:")]/following-sibling::span/span/a/text()').get(default="").strip() or player_agent
item["main_position"] = response.css('div[class="detail-position__inner-box"] > dl> dd::text').get(default="").strip()
item["other_position"] = response.css('div[class="detail-position__position"] > dl> dd::text').getall() or ["none"]
item["outfitter"] = info_table.xpath('//span[contains(text(), "Outfitter:")]/following-sibling::span/text()').get(default="").strip()
yield item
使用常规三脚架时的输出:
2022-09-29 19:15:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.transfermarkt.com/christian-vieri/profil/spieler/5797>
{'citizenship': '',
'date_of_birth': '',
'foot': '',
'height': '',
'main_position': 'Centre-Forward',
'name': 'Christian Vieri',
'other_position': ['none'],
'outfitter': '',
'place_of_birth': '',
'player_agent': ''}
2022-09-29 19:15:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.transfermarkt.com/luis-figo/profil/spieler/3446>
{'citizenship': '',
'date_of_birth': '',
'foot': '',
'height': '',
'main_position': 'Right Winger',
'name': 'Luís Figo',
'other_position': ['none'],
'outfitter': '',
'place_of_birth': '',
'player_agent': ''}
OUTPUT当我只使用玩家的url时:
2022-09-29 19:12:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.transfermarkt.com/christian-vieri/profil/spieler/5797> (referer: None)
2022-09-29 19:12:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.transfermarkt.com/christian-vieri/profil/spieler/5797>
{'name': 'Christian Vieri', 'date_of_birth': 'Jul 12, 1973', 'place_of_birth': 'Bologna', 'height': '1,85\xa0m', 'citizenship': 'Italy', 'foot': 'left', 'player_agent': '', 'main_position': '', 'other_position': ['none'], 'outfitter': ''}
2022-09-29 19:12:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.transfermarkt.com/luis-figo/profil/spieler/3446> (referer: None)
2022-09-29 19:12:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.transfermarkt.com/luis-figo/profil/spieler/3446>
{'name': 'Luís Figo', 'date_of_birth': 'Nov 4, 1972', 'place_of_birth': 'Almada', 'height': '1,80\xa0m', 'citizenship': 'Portugal', 'foot': 'right', 'player_agent': '', 'main_position': 'Right Winger', 'other_position': ['Right Midfield', 'Attacking Midfield'], 'outfitter': ''}
为什么选择器在scrapy shell中可以工作,或者在scrapy spider中只能使用玩家的url,而在scrapy spider中却不行?为什么只有name和main_position选择器可以工作,而其他的都不行?
1条答案
按热度按时间ktecyv1j1#
我尝试诊断来自函数
parse_info
的响应,并在解析之前添加了以下内容:这打开了一个scrapy shell,它使我能够以scrapy看到的方式查看响应,并测试选择器。显然,网站上的选择器与scrapy看到的不一样。info_table选择器应该改为
response.css('div[class="large-12 small-12 columns spielerdatenundfakten"]')
,这样就可以工作了。