如何使用python检索动态html内容的值

fkaflof6 于 2021-07-13 发布在 Java

关注(0)|答案(3)|浏览(392)

我使用的是python3，我试图从一个网站检索数据。但是，这些数据是动态加载的，我现在的代码不起作用：

url = eveCentralBaseURL + str(mineral)
print("URL : %s" % url);

response = request.urlopen(url)
data = str(response.read(10000))

data = data.replace("\\n", "\n")
print(data)

当我试图找到一个特定的值时，我会找到一个模板，例如“{formatprice median}}”，而不是“4.48”。
如何使其能够检索值而不是占位符文本？
编辑：这是我试图从中提取信息的特定页面。我试图得到“中值”，它使用模板{{formatprice median}}
编辑2：我已经安装并设置了使用selenium和beautifulsoup的程序。
我现在的代码是：

from bs4 import BeautifulSoup
from selenium import webdriver

# ...

driver = webdriver.Firefox()
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html)

print "Finding..."

for tag in soup.find_all('formatPrice median'):
    print tag.text

这是程序执行时的屏幕截图。不幸的是，它似乎找不到任何指定了“formatprice median”的东西。

python Html templates urllib

来源：https://stackoverflow.com/questions/67290612/how-to-webscrape-baseball-reference-page-to-get-monthly-performance

3条答案

按热度按时间

ztmd8pv51#

假设您试图从使用javascript模板（例如handlebar）呈现的页面中获取值，那么这就是任何标准解决方案（即。 beautifulsoup 或者 requests ).
这是因为浏览器使用javascript来改变它接收到的内容并创建新的dom元素。 urllib 将像浏览器一样执行请求部分，但不执行模板呈现部分。关于这些问题的详细描述可以在这里找到。本文讨论了三种主要的解决方案：
直接解析ajaxjson
使用脱机javascript解释器来处理请求spidermonkey，crowbar
使用浏览器自动化工具splinter
这个答案为选项3提供了更多的建议，比如 selenium 或水。我已经使用selenium进行了自动化的web测试，它非常方便。
编辑
从你的评论看来，这是一个车把驱动的网站。我推荐 selenium 和靓汤。这个答案提供了一个很好的代码示例，可能很有用：

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')

html = driver.page_source
soup = BeautifulSoup(html)

# check out the docs for the kinds of things you can do with 'find_all'

# this (untested) snippet should find tags with a specific class ID

# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class

for tag in soup.find_all("a", class_="my_class"):
    print tag.text

基本上，selenium从浏览器中获取呈现的html，然后您可以使用 page_source 财产。祝你好运：）

赞(0）回复(0）举报 2021-07-13

dw1jzc5e2#

我用 selenium +铬

`from selenium import webdriver
 from selenium.webdriver.chrome.options import Options

 url = "www.sitetotarget.com"
 options = Options()
 options.add_argument('--headless')
 options.add_argument('--disable-gpu')
 options.add_argument('--no-sandbox')
 options.add_argument('--disable-dev-shm-usage')`

赞(0）回复(0）举报 2021-07-13

llycmphe3#

建立另一个答案，但更完整。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless') #background task; don't open a window
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')#I copied this, so IDK?
options.add_argument('--disable-dev-shm-usage')#this too
driver.get(url)# set browser to use this page
time.sleep(6) # let the scripts load
html = driver.page_source #copy from chrome process to your python instance
driver.quit()

mac+chrome安装：

pip install selenium
brew cask install chromedriver
brew cask install google-chrome

赞(0）回复(0）举报 2021-07-13

我来回答

如何使用python检索动态html内容的值

3条答案

相关问题

热门标签

最新问答