beautifulSoup是一个灵活又方便的网页解析库,处理高效,支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取。
通过指令: pip install beautifulsoup4
或者在pycharm第三方库安装页面中搜索安装beautifulsoup4即可。
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(markup, ‘html.parser’) | Python的内置标准库、执行速度适中、文档容错能力强 | 低版本中文容错能力差 |
lxml HTML解析器 | BeautifulSoup(markup, ‘lxml’) | 速度快、文档容错能力强 | 需要安装C语言库 |
lxml XML解析器 | BeautifulSoup(markup, ‘xml’) | 速度快、唯一支持xml的解析器 | 需要安装C语言库 |
Html5lib | BeautifulSoup(markup, ‘html5lib’) | 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢,不依赖外部扩展 |
from bs4 import BeautifulSoup
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)
# 获取title标签
print(soup.title)
print(type(soup.title))
# 获取 head 标签
print(soup.head)
# 获取 p 标签
print(soup.p)
print(soup.title.name) # 'title'
print(soup.a.attrs['href']) # ’http://example.com/elsie‘
print(soup.p.string) # The Dormouse's story
print(soup.head.title.string)
print(soup.p.contents)
for x in soup.div.children:
print('x:', x)
for x in soup.div.descendants:
print('x:', x)
print(soup.span.parent)
for x in soup.span.parents:
print('x:', x)
print(list(enumerate(soup.a.next_siblings)))
print(list(enumerate(soup.a.previous_siblings)))
find_all表示查找所有,把它改成find表示查找单个
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
for ul in soup.find_all('ul'):
print(ul.find_all('li'))
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))
print(soup.find_all(text='Foo'))
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))
select_one只获取选择器选中的一个标签
版权说明 : 本文为转载文章, 版权归原作者所有 版权申明
原文链接 : https://blog.csdn.net/Lemon_Review/article/details/121755162
内容来源于网络,如有侵权,请联系作者删除!