我如何刮动态搜索结果页面与scrapy?[duplicate]

gopyfrb3  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(116)

此问题在此处已有答案

Can scrapy be used to scrape dynamic content from websites that are using AJAX?(9个答案)
6个月前关闭。
我试着从网站www.example.com上刮出结果https://howlongtobeat.com/#search。然而,当我刮的时候,20个结果中只有前6个结果。
我的代码:

import scrapy

cards =  response.css('div[class="search_list_details"]')

for card in cards: 
    game_name = card.css('a[class=text_white]::attr(title)').get()
    print(game_name)

输出:

'Elden Ring'
'Cyberpunk 2077'
'Kirby and the Forgotten Land'
'LEGO Star Wars The Skywalker Saga'
'Tomb Raider'
'Hollow Knight'
'Eiyuden Chronicle Rising' #This is not displayed on the page
'This War of Mine' #This is also not displayed on the page

我尝试使用其他选择器的卡,如response.css('li[class=back_darkish]'),但无济于事。
此外,我如何得到其他数据,如小时跳动,使我得到一个法令的名称,类型的完成和小时?:

<div>
    <div class="search_list_tidbit text_white shadow_text">Main Story</div>
    <div class="search_list_tidbit center time_100">50½ Hours </div>
    <div class="search_list_tidbit text_white shadow_text">Main + Extra</div>
    <div class="search_list_tidbit center time_100">94 Hours </div>
    <div class="search_list_tidbit text_white shadow_text">Completionist</div>
    <div class="search_list_tidbit center time_100">127 Hours </div>
</div>
vd2z7a6w

vd2z7a6w1#

实际上,数据是从外部url生成的,这是API调用HTML响应作为POST方法。

import scrapy
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'  
    def start_requests(self):
        url = 'https://howlongtobeat.com/search_results?page=1'
        payload = "queryString=&t=games&sorthead=popular&sortd=0&plat=&length_type=main&length_min=&length_max=&v=&f=&g=&detail=&randomize=0"
        headers = {
            "content-type":"application/x-www-form-urlencoded",
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
        }

        yield scrapy.Request(url,method='POST', body=payload,headers=headers,callback=self.parse)

    def parse(self, response):
        cards = response.css('div[class="search_list_details"]')

        for card in cards: 
            game_name = card.css('a[class=text_white]::attr(title)').get()
            yield {
                "game_name":game_name
            }

if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(TestSpider)
    process.start()

输出:

{'game_name': 'Elden Ring'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Cyberpunk 2077'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Kirby and the Forgotten Land'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'LEGO Star Wars The Skywalker Saga'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Hollow Knight'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Tomb Raider'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Portal 2'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Hades'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'The Witcher 3 Wild Hunt'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Red Dead Redemption 2'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'BioShock'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Portal'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Horizon Forbidden West'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Trek to Yomi'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Grand Theft Auto V'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'God of War'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Marvels Guardians of the Galaxy'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'BioShock Infinite'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Pokmon Legends Arceus'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Horizon Zero Dawn  Complete Edition'}
2022-05-12 13:37:12 [scrapy.core.engine] INFO: Closing spider (finished)
2022-05-12 13:37:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 490,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 2754,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 1.49537,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 5, 12, 7, 37, 12, 172047),
 'httpcompression/response_bytes': 23986,
 'httpcompression/response_count': 1,
 'item_scraped_count': 20,

相关问题