Python中有没有什么方法可以让我像jQuery那样解析HTML文档?也就是说,我希望能够使用CSS选择器语法从文档中抓取任意一组节点,读取它们的内容/属性等。
jQuery
2j4z5cfb1#
如果您熟悉BeautifulSoup,则可以将soupselect添加到库中。Soupselect是BeautifulSoup的CSS选择器扩展。使用方法:
from bs4 import BeautifulSoup as Soup from soupselect import select import urllib soup = Soup(urllib.urlopen('http://slashdot.org/')) select(soup, 'div.title h3')
个字符
vom3gejh2#
考虑PyQuery:http://packages.python.org/pyquery/
>>> from pyquery import PyQuery as pq >>> from lxml import etree >>> import urllib >>> d = pq("<html></html>") >>> d = pq(etree.fromstring("<html></html>")) >>> d = pq(url='http://google.com/') >>> d = pq(url='http://google.com/', opener=lambda url: urllib.urlopen(url).read()) >>> d = pq(filename=path_to_html_file) >>> d("#hello") [<p#hello.hello>] >>> p = d("#hello") >>> p.html() 'Hello world !' >>> p.html("you know <a href='http://python.org/'>Python</a> rocks") [<p#hello.hello>] >>> p.html() u'you know <a href="http://python.org/">Python</a> rocks' >>> p.text() 'you know Python rocks'
字符串
tjjdgumg3#
lxml库支持CSS selectors。
rvpgvaaj4#
css selectors
import requests from bs4 import BeautifulSoup as Soup html = requests.get('https://stackoverflow.com/questions/3051295').content soup = Soup(html)
soup.select('h1.grid--cell :first-child')[0].text
型问题赞成数
# first item soup.select_one('[itemprop="upvoteCount"]').text
型
4条答案
按热度按时间2j4z5cfb1#
如果您熟悉BeautifulSoup,则可以将soupselect添加到库中。
Soupselect是BeautifulSoup的CSS选择器扩展。
使用方法:
个字符
vom3gejh2#
考虑PyQuery:
http://packages.python.org/pyquery/
字符串
tjjdgumg3#
lxml库支持CSS selectors。
rvpgvaaj4#
BeautifulSoup,支持**
css selectors
**字符串
型
问题赞成数
型