axios 抓取网站时数据为空(cheerio.js)

7dl7o3gd  于 2023-10-18  发布在  iOS
关注(0)|答案(1)|浏览(160)

我在试着从CDC website中提取数据。
我使用cheerio.js来获取数据,并将HTML选择器复制到我的代码中,如下所示:

const listItems = $('#tab1_content > div > table > tbody > tr:nth-child(1) > td:nth-child(3)');

**然而,当我运行程序时,我只得到一个空数组。这怎么可能?我将HTML选择器逐字复制到我的代码中,那么为什么这不起作用?**以下是一个简短的视频,展示了这个问题:https://youtu.be/a3lqnO_D4pM

这里是我的完整代码,沿着一个链接,你可以运行代码:

const axios = require("axios");
const cheerio = require("cheerio");
const fs = require("fs");

// URL of the page we want to scrape
const url = "https://nccd.cdc.gov/DHDSPAtlas/reports.aspx?geographyType=county&state=CO&themeId=2&filterIds=5,1,3,6,7&filterOptions=1,1,1,1,1";

// Async function which scrapes the data
async function scrapeData() {
  try {
    // Fetch HTML of the page we want to scrape
    const { data } = await axios.get(url);
    // Load HTML we fetched in the previous line
    const $ = cheerio.load(data);
    // Select all the list items in plainlist class
    const listItems = $('#tab1_content > div > table > tbody > tr:nth-child(1) > td:nth-child(3)');
    // Stores data in array
    const dataArray = [];
    // Use .each method to loop through the elements
    listItems.each((idx, el) => {
      // Object holding data
      const dataObject = { name: ""};
      // Store the textcontent in the above object
      dataObject.name = $(el).text();
      // Populate array with data
      dataArray.push(dataObject);
    });
    // Log array to the console
    console.dir(dataArray);
  } catch (err) {
    console.error(err);
  }
}
// Invoke the above function
scrapeData();

在这里运行代码:https://replit.com/@STCollier/Web-Scraping#index.js
谢谢你的帮助。

nc1teljy

nc1teljy1#

数据是added dynamically after the page load,所以axios返回的内容不包含它。
目前有效的一种方法是使用Puppeteer拦截网络请求。

const puppeteer = require("puppeteer"); // ^21.0.2

const url =
  "https://nccd.cdc.gov/DHDSPAtlas/reports.aspx?geographyType=county&state=CO&themeId=2&filterIds=5,1,3,6,7&filterOptions=1,1,1,1,1";

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const urlPrefix =
    "https://nccd-proxy.services.cdc.gov/DHDSP_ATLAS/report/state";
  const responseP = page.waitForResponse(res =>
    res.url().startsWith(urlPrefix) &&
    res.request().method() === "POST"
  );
  await page.goto(url);
  const response = await responseP;
  console.log(await response.json());
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

输出量:

{
  TitleLong: 'Heart Disease Hospitalization Rate per 1,000 Medicare Beneficiaries, All Races/Ethnicities, All Genders, Ages 65+, 2018-2020',
  TitleShort: 'Heart Disease Hospitalization Rate per 1,000 Medicare Beneficiaries',
  TitleLegend: 'Age-Standardized Rate per 1,000 Beneficiaries',
  ReportText: 'heart disease hospitalization rate for All Races/Ethnicities, All Genders, Ages 65+ for  is ',
  Data: [
    {
      StateValue: 27.6,
      NationalValue: 41.6,
      RaceName: 'All Races/Ethnicities'
    },
    { StateValue: 36, NationalValue: 51.6, RaceName: 'Black' },
    { StateValue: 27.6, NationalValue: 41.5, RaceName: 'White' },
    { StateValue: 24.9, NationalValue: 32.6, RaceName: 'Hispanic' }
  ]
}

如果你想点击页面上的按钮来调整过滤器, puppet 师也可以做到这一点。

相关问题