html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
操作文档树最简单的方法就是告诉它你想获取的 tag 的 name。
如果想获取 <head>
标签,只要用 soup.head
:
可以在文档树的 tag 中多次调用这个方法。可以获取 <body>
标签中的第一个<b>
标签
soup.body.b
通过点取属性的方式只能获得当前名字的第一个 tag;
如果想要得到所有的 <a>
标签,或是通过名字得到比一个 tag 更多的内容的时候,就需要用到 Searching the tree 中描述的方法,比如: find_all()
print(soup.p)
print(soup.p)
print(soup.find_all('a'))
输出:
<p class="title"><b>The Dormouse's story</b></p> <p class="title"><b>The Dormouse's story</b></p>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
contents
属性可以将 tag 的子节点以列表的方式输出:tag_head = soup.head
print(tag_head)
print(tag_head.contents)
tag_title = tag_head.contents[0]
print(tag_title)
print(tag_title.contents)
输出:
<head><title>The Dormouse's story</title></head> [<title>The Dormouse's story</title>]
<title>The Dormouse's story</title> ["The Dormouse's story"]
<html>
标签也是 BeautifulSoup 对象的子节点:print(len(soup.contents))
print(soup.contents[0].name)
print(soup.contents[1].name)
输出:
2
None
html
为什么会和教程不一样呢?正常情况下只会有一个子节点,也就是 html,这里为什么会有两个而且,第一个为 None;
答:因为,我们输入文本文档时,这种格式默认前边有空行(空格),所以删除后,即是 1;
children
生成器,可以对 tag 的子节点进行循环:tag_title = soup.title
for child in tag_title.children:
print(child)
输出:
The Dormouse's story
contents
和 children
属性仅包含 tag 的直接子节点。
例如:<head>
标签只有一个直接子节点 <title>
;
<title>
标签也包含一个子节点:字符串 “The Dormouse’s story”;
这种情况下字符串 “The Dormouse’s story”也属于 <head>
标签的子孙节点;
descendants
属性可以对所有 tag 的子孙节点进行递归循环;
tag_head = soup.head
for child in tag_head.descendants:
print(child)
输出:
title>The Dormouse's story</title> The Dormouse's story
string
得到子节点:tag_title = soup.title
print(tag_title.string)
输出:
The Dormouse's story
strings
来循环获取:soup = BeautifulSoup(html_doc, 'html.parser')
for string in soup.strings:
print(repr(string)) #repr(将对象转化为供解释器读取的形式)
输出:
u'\n'
u"The Dormouse's story"
u'\n'
u'\n'
u"The Dormouse's story"
u'\n'
u'Once upon a time there were three little sisters; and their names were\n'
u'Elsie'
u',\n'
u'Lacie'
u' and\n'
u'Tillie'
u';\nand they lived at the bottom of a well.'
u'\n'
u'...'
u'\n'
输出的字符串中包含了很多空格或空行,使用 stripped_strings
可以去除多余空白内容:
全部是空格的行会被忽略掉,段首和段末的空白会被删除;
soup = BeautifulSoup(html_doc, 'html.parser')
for string in soup.stripped_strings:
print(repr(string)) #repr(将对象转化为供解释器读取的形式)
输出:
u"The Dormouse's story"
u"The Dormouse's story"
u'Once upon a time there were three little sisters; and their names were'
u'Elsie'
u','
u'Lacie'
u'and'
u'Tillie'
u';\nand they lived at the bottom of a well.'
u'...'
每个 tag 或字符串都有父节点:被包含在某个 tag 中;
parent
属性来获取某个元素的父节点;<head>
标签是 <title>
标签的父节点;tag_title = soup.title
print(tag_title)
print(tag_title.parent)
输出
<title>The Dormouse's story</title> <head><title>The Dormouse's story</title></head>
<html>
的父节点是 BeautifulSoup 对象:tag_html = soup.html
print(type(tag_html.parent))tag_html = soup.html
print(type(tag_html.parent))
输出:
<class 'bs4.BeautifulSoup'>
parents
属性可以递归得到元素的所有父辈节点;parents
方法遍历了 <a>
标签到根节点的所有节点:tag_a = soup.a
print(tag_a)
for parent in tag_a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
输出:
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
p
body
html
[document]
<b>
标签和 <c>
标签是同一层:他们是同一个元素的子节点,所以 <b>
和 <c>
可以被称为兄弟节点:from bs4 import BeautifulSoup
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print(sibling_soup.prettify())
输出:
# <html>
# <body>
# <a>
# <b>
# text1
# </b>
# <c>
# text2
# </c>
# </a>
# </body>
# </html>
next_sibling
和 previous_sibling
属性来查询兄弟节点;from bs4 import BeautifulSoup
brother_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print(brother_soup.b.next_sibling)
print()
print(brother_soup.c.previous_sibling)
输出:
<c>text2</c>
<b>text1</b>
next_sibling
和 previous_sibling
属性通常是字符串或空白.;<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
<a>
标签和第二个 <a>
标签之间的顿号和换行符;next_siblings
和 previous_siblings
属性可以对当前节点的兄弟节点迭代输出;from bs4 import BeautifulSoup
html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
soup = BeautifulSoup(html_doc, 'html.parser')
for sibling in soup.a.next_siblings:
print(repr(sibling))
输出:
',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'
for sibling in soup.find(id="link3").previous_siblings:
print(repr(sibling))
输出:
' and\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'
next_element
属性指向解析过程中下一个被解析的对象(字符串或 tag );next_sibling
相同,但通常是不一样的。tag_a_last = soup.find("a", id="link3")
print(tag_a_last)
print("-------------")
print(tag_a_last.next_sibling)
print("-------------")
print(tag_a_last.next_element)
输出:
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
-------------
;
and they lived at the bottom of a well.
-------------
Tillie
next_sibling
属性得到的是一串字符串,因为它解析时,遇到 <a>
标签会中断;next_element
属性得到的是在 <a>
标签解析之后的内容,不是 <a>
标签后的句子部分;
这是因为在原始文档中,字符串“Tillie” 在分号前出现,解析器先进入<a>
标签,然后是字符串“Tillie”,然后关闭</a>
标签,然后是分号和剩余部分。分号与<a>
标签在同一层级,但是字符串“Tillie”会被先解析。
previous_element
属性刚好与 next_element
相反,它指向当前被解析的对象的前一个解析对象。版权说明 : 本文为转载文章, 版权归原作者所有 版权申明
原文链接 : https://blog.csdn.net/S_numb/article/details/120201125
内容来源于网络,如有侵权,请联系作者删除!