从beautifulsoup提取特定数据输出

nuypyhwy  于 2021-08-20  发布在  Java
关注(0)|答案(2)|浏览(369)

我正在使用这个脚本。它提供了我想要的数据,但我所需要的只是“更新日期”部分。试图去掉后面的文字。


# import library

from bs4 import BeautifulSoup
import requests

# Request to website and download HTML contents

url='https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources'
req=requests.get(url)
content=req.text
soup=BeautifulSoup(content)
raw=soup.findAll(class_="module-content")[3].text
print(raw.strip())

这是我得到的输出:

Updated 1-19-2021

There are no views created for this resource yet.

粗体和斜体输出是我想要得到的,而不是其他项目。

zujrkrfu

zujrkrfu1#

你可以使用 find_next() 返回第一个下一个匹配项的方法:

raw=soup.findAll(class_="module-content")[3].find_next(text=True)

完整示例:

from bs4 import BeautifulSoup
import requests

# Request to website and download HTML contents

url='https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources'
req=requests.get(url)
content=req.text
soup=BeautifulSoup(content, "html.parser")
raw=soup.findAll(class_="module-content")[3].find_next(text=True)
print(raw.strip())

输出:

Updated 1-19-2021
bogh5gae

bogh5gae2#

尝试:

import requests
from bs4 import BeautifulSoup

url = "https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

print(soup.select_one(".inner-primary .module-content").contents[0].strip())

印刷品:

Updated 1-19-2021

相关问题