我正在创建一个scrapy bot,它可以从一个网站上抓取etfs,但无法让它抓取分页。我希望它从第二页抓取,但当我尝试这样做时,它会从基本URL抓取
密码:
class EtfsSpider(scrapy.Spider):
name = "etfs"
start_urls = ['https://etfdb.com/etfs/asset-class/bond/#etfs&sort_name=assets_under_management&sort_order=desc&page=2']
def parse(self, response):
etf_table = response.css('table#etfs tbody')
for etf in etf_table.css('tr'):
symbol = etf.css('td[data-th="Symbol"] a::text').get()
name = etf.css('td[data-th="ETF Name"] a::text').get()
total_assets = etf.css('td[data-th="Total Assets ($MM)"]::text').get()
avg_daily_vol = etf.css('td[data-th="Avg. Daily Volume"]::text').get()
closing_price = etf.css('td[data-th="Previous Closing Price"]::text').get()
yield {
"symbol": symbol,
"name": name,
"total assets": total_assets,
"average daily volume": avg_daily_vol,
"last closing price": closing_price
}
在我看来,这将转到start_urls中的url,在本例中,它将是etfs表的第二页,但这是我从控制台获得的输出:
2022-08-13 22:36:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://etfdb.com/robots.txt> (referer: None)
2022-08-13 22:36:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://etfdb.com/etfs/asset-class/bond/#etfs&sort_name=assets_under_management&sort_order=desc&page=2> (referer: None)
2022-08-13 22:36:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://etfdb.com/etfs/asset-class/bond/>
{'symbol': 'BND', 'name': 'Vanguard Total Bond Market ETF', 'total assets': '$84,446.60', 'average daily volume': None, 'last closing price': '$75.95'}
所以它说它抓取了正确的URL,但当它实际上抓取项目/数据时,它是从基本URL开始的,实际上只是第一页。我不知道如何修复这个问题
1条答案
按热度按时间ffscu2ro1#
数据是用JavaScript生成的,可以从JSON文件中获取。