Scrapy无法从正确的URL中抓取

waxmsbnn  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(151)

我正在创建一个scrapy bot,它可以从一个网站上抓取etfs,但无法让它抓取分页。我希望它从第二页抓取,但当我尝试这样做时,它会从基本URL抓取
密码:

class EtfsSpider(scrapy.Spider):
    name = "etfs"
    start_urls = ['https://etfdb.com/etfs/asset-class/bond/#etfs&sort_name=assets_under_management&sort_order=desc&page=2']

def parse(self, response):
        etf_table = response.css('table#etfs tbody')

        for etf in etf_table.css('tr'):
            symbol = etf.css('td[data-th="Symbol"] a::text').get()
            name = etf.css('td[data-th="ETF Name"] a::text').get()
            total_assets = etf.css('td[data-th="Total Assets ($MM)"]::text').get()
            avg_daily_vol = etf.css('td[data-th="Avg. Daily Volume"]::text').get()
            closing_price = etf.css('td[data-th="Previous Closing Price"]::text').get()

            yield {
                "symbol": symbol,
                "name": name,
                "total assets": total_assets,
                "average daily volume": avg_daily_vol,
                "last closing price": closing_price
            }

在我看来,这将转到start_urls中的url,在本例中,它将是etfs表的第二页,但这是我从控制台获得的输出:

2022-08-13 22:36:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://etfdb.com/robots.txt> (referer: None)
2022-08-13 22:36:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://etfdb.com/etfs/asset-class/bond/#etfs&sort_name=assets_under_management&sort_order=desc&page=2> (referer: None)
2022-08-13 22:36:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://etfdb.com/etfs/asset-class/bond/>
{'symbol': 'BND', 'name': 'Vanguard Total Bond Market ETF', 'total assets': '$84,446.60', 'average daily volume': None, 'last closing price': '$75.95'}

所以它说它抓取了正确的URL,但当它实际上抓取项目/数据时,它是从基本URL开始的,实际上只是第一页。我不知道如何修复这个问题

ffscu2ro

ffscu2ro1#

数据是用JavaScript生成的,可以从JSON文件中获取。

import scrapy

class EtfsSpider(scrapy.Spider):
    name = "etfs"
    # I got the JSON url from the "network" tab in the browser's devtools
    start_urls = ['https://etfdb.com/data_set/?tm=92960&cond={%22by_type%22:[%22Etfdb::EtfType%22,374,null,false,false]}&no_null_sort=true&count_by_id=&limit=25&sort=assets_under_management&order=desc&limit=25&offset=']

    # the offset start with zero and incremented with "limit" value every page
    offset = 0

    # I got the headers from the "network" tab in the browser's devtools
    headers = {
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "en-US,en;q=0.5",
        "Cache-Control": "no-cache",
        "Connection": "keep-alive",
        "Content-Type": "application/json",
        "DNT": "1",
        "Host": "etfdb.com",
        "Pragma": "no-cache",
        "Referer": "https://etfdb.com/etfs/asset-class/bond/",
        "Sec-Fetch-Dest": "empty",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Site": "same-origin",
        "Sec-GPC": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
        "X-Requested-With": "XMLHttpRequest"
    }

    # use delay to reduce the chance of getting blocked
    custom_settings = {'DOWNLOAD_DELAY': 0.3}

    def start_requests(self):
        url = self.start_urls[0] + str(self.offset)
        yield scrapy.Request(url=url, headers=self.headers)

    def parse(self, response):
        json_data = response.json()
        # stopping condition
        if not json_data['rows']:
            self.logger.info("Finished scraping")
            return

        for item in json_data['rows']:
            # "symbol" and "name" are both html values so we need to use selectors in order to get the text
            symbol = item['symbol']
            selector = scrapy.Selector(text=symbol, type="html")
            symbol = selector.xpath('//text()').get()

            name = item['name']
            selector = scrapy.Selector(text=name, type="html")
            name = selector.xpath('//text()').get()

            total_assets = item['assets_under_management']
            avg_daily_vol = item['three_month_average_volume']
            closing_price = item['price']

            yield {
                "symbol": symbol,
                "name": name,
                "total assets": total_assets,
                "average daily volume": avg_daily_vol,
                "last closing price": closing_price
            }

        # next page
        self.offset += 25
        url = self.start_urls[0] + str(self.offset)
        yield scrapy.Request(url=url, headers=self.headers)

相关问题