python Coinmarketcap数据只返回前10个结果,为什么其余90个不返回?

jvidinwx  于 2023-09-29  发布在  Python
关注(0)|答案(1)|浏览(96)

我没有问题刮它,甚至任何数量的网页,我定义,但它只显示前10个结果的每一页

def scrape_pages(page_num):
for page in range(1, page_num+1):
    headers = {'User-Agent': 
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

    url = "https://coinmarketcap.com/?page={}".format(page)
    page_tree = requests.get(url, headers=headers)
    pageSoup = BeautifulSoup(page_tree.content, 'html.parser')

    print("Page {} Parsed successfully!".format(url))
ybzsozfc

ybzsozfc1#

这是因为前十个结果在您返回的HTML中。但是,其余部分是由JavaScript动态添加的,所以BeautifulSoup不会看到这一点,因为它根本不存在。
但是,您可以使用一个API来获取表数据(如果您喜欢,也可以用于所有页面)。
具体操作如下:

from urllib.parse import urlencode

import requests
from tabulate import tabulate

query_string = [
    ('start', '1'),
    ('limit', '100'),
    ('sortBy', 'market_cap'),
    ('sortType', 'desc'),
    ('convert', 'USD'),
    ('cryptoType', 'all'),
    ('tagType', 'all'),
]

base = "https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?"
response = requests.get(f"{base}{urlencode(query_string)}").json()

results = [
    [
        currency["name"],
        round(currency["quotes"][0]["price"], 4),
    ]
    for currency in response["data"]["cryptoCurrencyList"]
]

print(tabulate(results, headers=["Currency", "Price"], tablefmt="pretty"))

输出量:

+-----------------------+------------+
|       Currency        |   Price    |
+-----------------------+------------+
|        Bitcoin        | 46204.9211 |
|       Ethereum        | 1488.0481  |
|        Tether         |   0.9995   |
|     Binance Coin      |  212.8729  |
|        Cardano        |    0.93    |
|       Polkadot        |  31.1603   |
|          XRP          |   0.4464   |
|       Litecoin        |  167.2676  |
|       Chainlink       |  25.1752   |
|     Bitcoin Cash      |  488.9875  |
|        Stellar        |   0.3724   |
|       USD Coin        |   0.9998   |
|                       |            |
|     and many more     |   values   |
+-----------------------+------------+

编辑:要在页面上循环,您可能想尝试以下操作:

from urllib.parse import urlencode

import requests

query_string = [
    ('start', '1'),
    ('limit', '100'),
    ('sortBy', 'market_cap'),
    ('sortType', 'desc'),
    ('convert', 'USD'),
    ('cryptoType', 'all'),
    ('tagType', 'all'),
]

base = "https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?"

with requests.Session() as session:
    response = session.get(f"{base}{urlencode(query_string)}").json()
    last_page = (int(response["data"]["totalCount"]) // 100) + 1
    all_pages = [1 if i == 1 else (i * 100) + 1 for i in range(1, last_page)]

    for page in all_pages[:2]:  # Get the first two pages; remove the slice to get all pages.
        query_string = [
            ('start', str(page)),
            ('limit', '100'),
            ('sortBy', 'market_cap'),
            ('sortType', 'desc'),
            ('convert', 'USD'),
            ('cryptoType', 'all'),
            ('tagType', 'all'),
        ]
        response = session.get(f"{base}{urlencode(query_string)}").json()
        results = [
            [
                currency["name"],
                round(currency["quotes"][0]["price"], 4),
            ]
            for currency in response["data"]["cryptoCurrencyList"]
        ]
        print(results)
  • 注意:* 我通过将[:2]添加到for loop来限制这个例子,但是如果你想处理所有的页面,只需删除这个[:2],所以循环看起来像这样:
for page in all_pages:
    #  the rest of the body ...

相关问题