用Beautiful Soup从JavaScript中提取数组值

omqzjyyz  于 2022-12-17  发布在  Java
关注(0)|答案(1)|浏览(169)

我尝试用Python构建一个scraper,它从网页HTML中的JavaScript代码获取一个变量,这个变量会随着时间的推移而变化。我需要yValues变量的第一个数字:

jQuery(document).ready(function() {
  var draw = true;
  
  if ("Biblioteca di Ingegneria" == "") {
    draw = false;
  }
  
  if (draw) {
    var yValues = [
        "28",
        "100"
      ];
    var Titolo = "Biblioteca di Ingegneria";
    var sottoTitolo = "Posti Totali: 128";
    var barColors = [
        "#167d21",
        "#ed2135"
      ];
    var xValues = [
        "Liberi (28)",
        "Occupati (100)"
      ];
    
    new Chart("InOutChart", {
      type: "pie",
      data: {
        labels: xValues,
        datasets: [
          {
            backgroundColor: barColors,
            data: yValues
          }
        ]
      },
      options: {
        plugins: {
          title: {
            display: true,
            text: Titolo,
            font: {
              size: 25,
              style: 'normal',
              lineHeight: 1.2
            },
            // padding: {
            //   top: 10,
            //   bottom: 30
            // }
          },
          subtitle: {
            display: true,
            text: sottoTitolo,
            font: {
              size: 20,
              style: 'normal',
              lineHeight: 1.2
            },
            padding: {
              bottom: 30
            }
          },
          legend: {
            display: true,
            position: "bottom",
            labels: {
              font: {
                size: 20,
                style: 'normal',
                lineHeight: 1.2
              }
            }
          }
        },
        responsive: true,
        maintainAspectRatio: false,
        scales: {
          yAxes: [
            {
              display: true,
              ticks: {
                beginAtZero: true
              }
            }
          ]
        }
      }
    });
  }
});

这是我能做的最好的:

from bs4 import BeautifulSoup
import requests

# Make a GET request to the URL of the web page.
base_url = 'https://qrbiblio.unipi.it/Home/Chart?IdCat=a96d84ba-46e8-47a1-b947-ab98a8746d6f'
response = requests.get(base_url)

# Parse the HTML content of the page.
soup = BeautifulSoup(response.text, "html.parser")

# Find all the `<script>` elements on the page.
scripts = soup.find_all("script")

# Get the 8th `<script>` element.
script8 = scripts[7]

# Transform the 8th `<script>` into a string.
script8_txt = "".join(script8)

# Get the useful string from the 8th `<script>`.
usefull_txt = script8_txt[248:251]
        
# Get the int from the string.
pl = int("".join(filter(str.isdigit, usefull_txt)))

print(pl)

这是可行的,但我想自动解析JavaScript代码来找到变量并获取其值,因为正如你所看到的,我手动检查了我需要的字符的位置。我正在寻找一个更好的解决方案,因为我计划将此代码用于其他类似的网页,但变量的位置每次都在变化。我想把这个Python代码放在一个Alexa技能中,所以我不知道Selenium包是否能很好地工作。

ss2ws0br

ss2ws0br1#

试试这个:

import ast

import requests
from bs4 import BeautifulSoup

base_url = 'https://qrbiblio.unipi.it/Home/Chart?IdCat=a96d84ba-46e8-47a1-b947-ab98a8746d6f'
response = requests.get(base_url)

script = (
    BeautifulSoup(response.text, "html.parser")
    .find_all("script")[7]
    .string
)
numbers = ast.literal_eval(
    script.strip().split("var yValues = ")[1].split(";")[0]
)
print(numbers)
print(numbers[0])

输出:

['130', '0']
130

相关问题