selenium 无法从网页中选择正确的div

p1iqtdky 于 2022-11-10 发布在其他

关注(0)|答案(3)|浏览(134)

我正在尝试解析某个网站上的歌曲标题，但不知道如何获取包含它们的特定div。我尝试了十几种不同的方法，但总是得到一个空的清单。
如果你打开url并查看其中一个YouTube视频，你会发现一个类为single-post-oembed-youtube-wrapper的div。该元素还包含歌曲的艺术家和标题。
这是我第一次尝试从网页上抓取数据，有人能帮我吗？

import json
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import pprint
from webdriver_manager.chrome import ChromeDriverManager
import sys

html = None
url = 'https://ultimateclassicrock.com/best-rock-songs-2018/'

browser = webdriver.Chrome(executable_path="/usr/bin/chromedriver")
browser.get(url)

soup = BeautifulSoup(browser.page_source, 'html.parser')
divs = soup.find_all("div", {"class":"single-post-oembed-youtube-wrapper'"})

# all_songs = browser.find_elements(By.CLASS_NAME, 'single-post-oembed-youtube-wrapper')

# html = all_songs.get_attribute("outerHTML")

pprint.pprint(divs)
browser.close()

selenium

来源：https://stackoverflow.com/questions/74331208/unable-to-select-the-correct-div-from-a-webpage

3条答案

按热度按时间

t2a7ltrp1#

请尝试这个：

soup = BeautifulSoup(browser.page_source, 'html.parser')
titles = soup.find_all(".single-post-oembed-youtube-wrapper+div p strong")

这将为您提供那里的所有标题

赞(0）回复(0）举报 2022-11-10

njthzxwz2#

您还可以尝试直接从HTML源检索数据，从而避免Selify。

import requests
from bs4 import BeautifulSoup
import pandas

url = "https://ultimateclassicrock.com/best-rock-songs-2018/"
res = requests.get(url)
soup = BeautifulSoup(res.content)

results = []
for elem in soup.find_all("strong"):
    if "," in elem.text:
        results.append(elem.text.split(", "))

df = pd.DataFrame(results, columns=["artist", "song"])
df

产出：

artist  song
0   Steve Perry     'Sun Shines Gray'
1   Paul McCartney  'I Don't Know'
2   Judas Priest    'Flamethrower'
3   Ace Frehley     'Rocking With the Boys'
4   Paul Simon  'Questions for the Angels'
...

这有点老生常谈，但适用于您的示例。

赞(0）回复(0）举报 2022-11-10

rdrgkggo3#

您可以从API获取所有需要的数据

import requests

api_url= 'https://ultimateclassicrock.com/rest/carbon/api/menu/category/album-reviews/'
headers={
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
    }
data=[]

res=requests.get(api_url,headers=headers)

# print(res)

for item in res.json()['widgets']['dataDetails'].values():
    title = item['data']['mainData']['title']

输出：

Reissue Roundup: Summer Sets From Blondie, Lou Reed and More
Todd Rundgren, &apos;Space Force&apos;: Album Review
Pink Floyd, &apos;Animals (2018 Remix)&apos;: Album Review
Sammy Hagar and the Circle, &apos;Crazy Times&apos;: Album Review
Ringo Starr, &apos;EP3&apos;: Album Review
Billy Idol, &apos;The Cage EP&apos;: Album Review
Beatles, &apos;Revolver Special Edition (Super Deluxe)&apos;: Album Review
Richard Marx, &apos;Songwriter&apos;: Album Review
The Cult, &apos;Under the Midnight Sun&apos;: Album Review
Various, &apos;Here It Is: A Tribute to Leonard Cohen&apos;: Album Review
Red Hot Chili Peppers, &apos;Return of the Dream Canteen&apos;: Review
Skid Row, &apos;The Gang&apos;s All Here&apos;: Album Review

赞(0）回复(0）举报 2022-11-10

我来回答

selenium 无法从网页中选择正确的div

3条答案

相关问题

热门标签

最新问答