Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库.
安装 Beautiful Soup
pip 安装:pip install beautifulsoup4
安装解析器:
lxml 解析器
:pip install lxml
html5lib 解析器
:pip install html5lib
解析器的优缺点:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")
首先,文档被转换成 Unicode,并且 HTML 的实例都被转换成 Unicode 编码;
*
然后,Beautiful Soup 选择最合适的解析器来解析这段文档,如果手动指定解析器那么 Beautiful Soup 会选择指定的解析器来解析文档;
*
栗子:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html>data<html>")
print(soup.prettify()) # 标准的缩进格式的结构输出
<html>
<body>
<p>
data
</p>
</body>
</html>
Beautiful Soup 将复杂 HTML 文档转换成一个复杂的树形结构,每个节点都是 Python 对象,所有对象可以归纳为 4 种:
Tag ;
NavigableString ;
BeautifulSoup ;
Comment;
生成的 Beautifulsoup 对象,转换为 tag 对象,后边的.
是根据标签来确定的,如果是 b 就是 .b
,是 p 就是 .p
;
Tag
对象与 XML 或 HTML 原生文档中的 Tag 相同:from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
print(type(tag))
--- 输出 ---
<class 'bs4.element.Tag'>
.name
来获取:from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
print(tag.name)
输出
b
from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
tag.name = "blockquote"
print(tag)
结果:
<blockquote class="boldest">Extremely bold</blockquote>
tag <b class="boldest">
有一个 “class” 的属性,值为 “boldest”;tag['class']
.attrs
from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
tag.name = "blockquote"
print(tag.attrs) # 访问 tag 的属性
print(tag)
tag['class'] = 'verybold' # 修改属性
tag['id'] = 123 # 添加属性
print(tag)
del tag['class'] # 删除属性
print(tag)
{'class': ['boldest']}
<blockquote class="boldest">Extremely bold</blockquote>
<blockquote class="verybold" id="123">Extremely bold</blockquote>
<blockquote id="123">Extremely bold</blockquote>
list
;from bs4 import BeautifulSoup
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
tag = css_soup.p
print(tag['class'])
['body', 'strikeout']
from bs4 import BeautifulSoup
css_soup = BeautifulSoup('<p id="body strikeout"></p>')
tag = css_soup.p
print(tag['id'])
输出:
body strikeout
from bs4 import BeautifulSoup
css_soup = BeautifulSoup('<p rel="body"></p>')
tag = css_soup.p
print(tag['rel'])
tag['rel'] = ['body', 'strikeout']
print(tag['rel'])
print(tag)
输出:
body
['body', 'strikeout']
<p rel="body strikeout"></p>
from bs4 import BeautifulSoup
css_soup = BeautifulSoup('<p rel="body strikeout"></p>', 'xml')
tag = css_soup.p
print(tag['rel'])
body strikeout
NavigableString 类
来包装 tag 中的字符串;from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
print(tag.string)
print(type(tag.string))
输出:
Extremely bold
<class 'bs4.element.NavigableString'>
unicode()
方法可以直接将 NavigableString 对象转换成 Unicode 字符串;from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
print(tag.string)
print(type(tag.string))
unicode_string = unicode(tag.string)
print(unicode_string)
print(type(unicode_string))
--- 输出 ---
Extremely bold
<class 'bs4.element.NavigableString'>
Extremely bold
<type 'unicode'>
from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
print(tag.string)
print(type(tag.string))
unicode_string = str(tag.string)
print(unicode_string)
print(type(unicode_string))
--- 输出 ---
Extremely bold
<class 'bs4.element.NavigableString'>
Extremely bold
<class 'str'>
replace_with()
方法:from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
print(tag)
tag.string.replace_with("change string")
print(tag)
输出:
<b class="boldest">Extremely bold</b>
<b class="boldest">change string</b>
Tag
对象,它支持遍历文档树和搜索文档树中描述的大部分方法;.name
属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name
。from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
print(soup.name)
输出:
[document]
Tag , NavigableString , BeautifulSoup 几乎覆盖了 html 和 xml 中的所有内容,但是还有一些特殊对象:
文档的注释部分
栗子:
from bs4 import BeautifulSoup
annotate = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" #内容是注释
text = "<b>This is text</b>"
soup_annotate = BeautifulSoup(annotate)
soup_text = BeautifulSoup(text)
comment_annotate = soup_annotate.b.string
comment_text = soup_text.b.string
print(type(comment_annotate))
print(type(comment_text))
结果:
<class 'bs4.element.Comment'>
<class 'bs4.element.NavigableString'>
版权说明 : 本文为转载文章, 版权归原作者所有 版权申明
原文链接 : https://blog.csdn.net/S_numb/article/details/120200992
内容来源于网络,如有侵权,请联系作者删除!