csv 尝试从新闻稿检索数据时,脚本未返回正确的输出

euoag5mw  于 2022-12-15  发布在  其他
关注(0)|答案(2)|浏览(139)

我试着写一个脚本,可以从音乐商店的时事通讯中检索专辑标题和乐队名称。乐队名称和专辑标题隐藏在一个h3 & h4类中。当执行脚本时,我在csv文件中得到一个空白输出。
'

from bs4 import BeautifulSoup
import requests
import pandas as pd

# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)

# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Find all 'a' elements with the class 'row'
albums = soup.find_all('a', attrs={'class': 'row'})

# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for album in albums:
  album_title_element = album.find('td', attrs={'td_class': 'h3 class'})
  band_name_element = album.find('td', attrs={'td_class': 'h4 class'})
  album_title.append(album_title_element.text)
  band_name.append(band_name_element.text)

# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv')

我认为错误是在attrs部分,不知道如何正确修复它。提前感谢!

inkz8wg9

inkz8wg91#

from bs4 import BeautifulSoup
import requests
import pandas as pd

# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)

# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Find all 'a' elements with the class 'row'
albums = soup.find_all('td', attrs={'class': 'block__cell'})

# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for album in albums:
  album_title_element = album.find('h3', attrs={'class': 'header'})
  band_name_element = album.find('h4', attrs={'class': 'header'})
  album_title.append(album_title_element.text)
  band_name.append(band_name_element.text)

# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv')

感谢无名英雄的帮助!

fumotvh3

fumotvh32#

查看您的代码,我同意错误出在attrs部分。您面临的问题是,您试图抓取的站点不包含带有'row'类的'a'元素。因此find_all返回一个空列表。有大量带有'row'类的'div'元素,也许您打算查找这些元素?
查找'td'元素并提取它们的'h3'和'h4'元素是正确的,但由于albums是一个空列表,因此没有元素可供查找。
我稍微修改了代码,直接查找“td”元素,并提取它们的“h3”和“h4”元素。

from bs4 import BeautifulSoup
import requests
import pandas as pd

# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)

# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Find all 'a' elements with the class 'row'
albums = soup.find_all('td', attrs={'class': 'block__cell'} )

# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for i, album in enumerate(albums):
  album_title_element = album.find('h3')
  band_name_element = album.find('h4')
  album_title.append(album_title_element.text)
  band_name.append(band_name_element.text)

# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv', index=False)

我还冒昧地在代码的最后一行添加了index=False,这样每行就不会以,开头。
希望这个有用。

相关问题