美丽的汤innerhtml？

5anewei6 于 2022-11-27 发布在其他

关注(0)|答案(8)|浏览(259)

假设我有一个包含div的页面，我可以很容易地得到包含soup.find()的div。
现在我有了结果，我想打印div的整个innerhtml：我的意思是，我需要一个字符串，所有的html标签和文本都在一起，就像我在javascript中用obj.innerHTML得到的字符串一样。这可能吗？

Html

来源：https://stackoverflow.com/questions/8112922/beautifulsoup-innerhtml

8条答案

按热度按时间

axzmvihb1#

TL;DR

在BeautifulSoup 4中，如果需要UTF-8编码的字节串，请使用element.encode_contents();如果需要Python Unicode字符串，请使用element.decode_contents()。

def innerHTML(element):
    """Returns the inner HTML of an element as a UTF-8 encoded bytestring"""
    return element.encode_contents()

这些函数目前不在在线文档中，所以我将引用当前的函数定义和代码中的文档字符串。

`encode_contents`-自4.0.4版起

def encode_contents(
    self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,
    formatter="minimal"):
    """Renders the contents of this tag as a bytestring.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param encoding: The bytestring will be in this encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

另请参阅有关格式化程序的文档;您很可能使用formatter="minimal"（默认值）或formatter="html"（对于html entities），除非您希望以某种方式手动处理文本。
encode_contents返回编码的字节串。如果需要Python Unicode字符串，则使用decode_contents。

`decode_contents`-自4.0.1版起

decode_contents与encode_contents执行相同的操作，但返回Python Unicode字符串而不是编码的字节串。

def decode_contents(self, indent_level=None,
                   eventual_encoding=DEFAULT_OUTPUT_ENCODING,
                   formatter="minimal"):
    """Renders the contents of this tag as a Unicode string.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param eventual_encoding: The tag is destined to be
       encoded into this encoding. This method is _not_
       responsible for performing that encoding. This information
       is passed in so that it can be substituted in if the
       document contains a <META> tag that mentions the document's
       encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

美丽的汤3

BeautifulSoup 3没有上述功能，而是有renderContents

def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
                   prettyPrint=False, indentLevel=0):
    """Renders the contents of this tag as a string in the given
    encoding. If encoding is None, returns a Unicode string.."""

为了与BS 3兼容，BeautifulSoup 4（在4.0.4中）重新添加了此函数。

赞(0）回复(0）举报 2022-11-27

kxe2p93d2#

其中一个选项可以是这样的：

innerhtml = "".join([str(x) for x in div_element.contents])

赞(0）回复(0）举报 2022-11-27

8wtpewkr3#

给定一个BS4 soup元素（如<div id="outer"><div id="inner">foobar</div></div>），这里有一些不同的方法和属性，可以用来以不同的方式检索它的HTML和文本沿着给出了它们将返回什么的示例。

内部HTML：

inner_html = element.encode_contents()

'<div id="inner">foobar</div>'

外部HTML：

outer_html = str(element)

'<div id="outer"><div id="inner">foobar</div></div>'

外部HTML（美化）：

pretty_outer_html = element.prettify()

'''<div id="outer">
 <div id="inner">
  foobar
 </div>
</div>'''

仅文本（使用.text）：

element_text = element.text

'foobar'

仅文本（使用.string）：

element_string = element.string

'foobar'

赞(0）回复(0）举报 2022-11-27

7uhlpewt4#

str(element)帮助您获取outerHTML，然后从外部html字符串中删除外部标记。

赞(0）回复(0）举报 2022-11-27

iqih9akk5#

就unicode(x)怎么样？看起来很适合我。

**编辑：**这将为您提供外部HTML，而不是内部HTML。

赞(0）回复(0）举报 2022-11-27

g6baxovj6#

最简单的方法是使用children属性。

inner_html = soup.find('body').children

它会返回一个列表。2所以，你可以用一个简单的for循环得到完整的代码。

for html in inner_html:
    print(html)

赞(0）回复(0）举报 2022-11-27

htrmnn0y7#

如果我没有误解的话，你的意思是举这样一个例子：

<div class="test">
    text in body
    <p>Hello World!</p>
</div>

输出应如下所示：

text in body
    <p>Hello World!</p>

这就是你的答案：

''.join(map(str,tag.contents))

赞(0）回复(0）举报 2022-11-27

qyuhtwio8#

美丽的汤4 get_text()
如果只需要文档或标记中的可读文本，可以使用get_text()方法。该方法将文档中或标记下的所有文本作为单个Unicode字符串返回：

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup, 'html.parser')

soup.get_text()
'\nI linked to example.com\n'
soup.i.get_text()
'example.com'

您可以指定一个字符串，用于将文本位连接在一起：

soup.get_text("|")
'\nI linked to |example.com|\n'

您可以让Beautiful Soup从每段文本的开头和结尾去掉空格：

soup.get_text("|", strip=True)
'I linked to|example.com'

但此时您可能希望使用.stripped_strings生成器，并自己处理文本：

[text for text in soup.stripped_strings]
# ['I linked to', 'example.com']

自Beautiful Soup 4.9.0版起，当lxml或html.parser正在使用时，<script>、<style>和<template>标签的内容不会被视为‘text’，因为这些标签不是页面中人类可见内容的一部分。
请访问：https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text

赞(0）回复(0）举报 2022-11-27

我来回答

美丽的汤innerhtml？

8条答案

TL;DR

`encode_contents`-自4.0.4版起

`decode_contents`-自4.0.1版起

美丽的汤3

相关问题

热门标签

最新问答

美丽的汤innerhtml？

8条答案

TL;DR

encode_contents-自4.0.4版起

decode_contents-自4.0.1版起

美丽的汤3

相关问题

热门标签

最新问答

`encode_contents`-自4.0.4版起

`decode_contents`-自4.0.1版起