python 美丽的汤-获得所有文本，并保留链接html？

我正在使用漂亮的汤解析多个HTML页面。大多数情况下工作得很好。我想包括文本以及链接的URL。
当前语法为：

soup = MyBeautifulSoup(''.join(body), 'html.parser')
            body_text = self.remove_newlines(soup.get_text())

我在网上找到了一个覆盖_all_strings函数的建议：

class MyBeautifulSoup(BeautifulSoup):
    def _all_strings(self, strip=False, types=(NavigableString, CData)):
        for descendant in self.descendants:
            # return "a" string representation if we encounter it

            if isinstance(descendant, Tag) and descendant.name == 'a':
                yield str('<{}> '.format(descendant.get('href', '')))

            # skip an inner text node inside "a"
            if isinstance(descendant, NavigableString) and descendant.parent.name == 'a':
                continue

            # default behavior
            if (
                (types is None and not isinstance(descendant, NavigableString))
                or
                    (types is not None and type(descendant) not in types)):
                continue

            if strip:
                descendant = descendant.strip()
                if len(descendant) == 0:
                    continue
            yield descendant

但是，这会产生运行时错误：

in _all_strings
    (types is not None and type(descendant) not in types)):
TypeError: argument of type 'object' is not iterable

是否有方法绕过此错误？
谢谢!

尝试将and type(descendant) not in types更改为**and not isinstance(descendant, types)***

使用您建议的更改，我得到了相同的错误

in _all_strings
    (types is not None and type(descendant) not in types)):
TypeError: argument of type 'object' is not iterable

但是更改后的错误应该是不同的-错误追溯中提到的行（* types is not None and type(descendant) not in types) *）正是我建议更改的行...
无论如何，我对此进行了更深入的研究，我建议的更改只是引起了另一种错误;主要问题是由get_text使用types=default调用_all_strings引起的;default在这个新子类中似乎没有任何意义（这就是为什么您会看到 * <class 'object'> .... *）。
你可以在函数的开头检查types，确保它是一个类或者是一个元组：

class MyBeautifulSoup(BeautifulSoup):
    def _all_strings(self, strip=False, types=(NavigableString, CData)):
        # verify types [ADDED]
        if hasattr(types,'__iter__'):
            types = tuple([t for t in types if isinstance(t,type)])
        if types is None: types = NavigableString
        if not types or not isinstance(types, (type, tuple)):
            types = (NavigableString, CData)

        for descendant in self.descendants:
            # return "a" string representation if we encounter it

            if isinstance(descendant, Tag) and descendant.name == 'a':
                yield str('<{}> '.format(descendant.get('href', '')))

            # skip an inner text node inside "a"
            if isinstance(descendant,NavigableString) and descendant.parent.name == 'a':
                continue
            
            if not isinstance(descendant, types): continue # default behavior [EDITED]

            if strip:
                descendant = descendant.strip()
                if len(descendant) == 0:
                    continue
            yield descendant

然而，不使用default作为types也意味着JavaScript和CSS不再被排除在外。所以[个人]我更愿意用两个全新的方法编写新类，并使用它们：

# from urllib.parse import urljoin

class MyBeautifulSoup(BeautifulSoup):
    def _all_strings_plus(  self, strip=True, types=NavigableString, 
                            aRef={'a': lambda a: f"<{a.get('href', '')}>"}, 
                            skipTags=['script', 'style']    ):
        # verify types
        if hasattr(types,'__iter__') and not isinstance(types,type):
            types = tuple([t for t in types if isinstance(t, type)])
        if not (types and isinstance(types,(type,tuple))): types = NavigableString
        
        # skip text in tags included in aRef
        # skipTags += list(aRef.keys())
        
        for descendant in self.descendants:
            # yield extra strings according to aRef
            if isinstance(descendant, Tag) and descendant.name in aRef:
                extraStr = aRef[descendant.name](descendant)
                if isinstance(extraStr, str): yield extraStr

            # skip text nodes DIRECTLY inside a Tag in aRef
            # if descendant.parent.name in aRef: continue

            # skip ALL text nodes inside skipTags 
            if skipTags and descendant.find_parent(skipTags): continue

            # default behavior
            if not isinstance(descendant, types): continue

            if strip:
                descendant = descendant.strip()
                if len(descendant) == 0: continue
            yield descendant
    
    def get_text_plus(self, separator=" ", srcUrl=None, **aspArgs):
        if srcUrl and isinstance(srcUrl, str):
            def hrefStr(aTag):
                href = aTag.get('href')
                if not (href is None or href.startswith('javascript:')):
                    return f"<{urljoin(srcUrl, href)}>"
            aspArgs.setdefault('aRef', {})
            aspArgs['aRef']['a'] = hrefStr
        
        return separator.join(self._all_strings_plus(**aspArgs))

这还允许您指定源URL [如soup.get_text(srcUrl='https://example.com/')]，以便relative paths可以转换为完整链接，并且可以忽略JavaScript链接（如果您不介意忽略这些链接，请删除条件中的 * or href.startswith('javascript:') * 部分）。
顺便说一句，我已经注解掉了a标记[或者，更确切地说，aRef中包含的任何标记]中省略字符串的部分，但是您可以根据需要取消注解它们;但是，如果取消注解 * skipTags += list(aRef.keys()) * 行，则 * if descendant.parent.name in aRef: continue * 行将变得多余。

python 美丽的汤-获得所有文本，并保留链接html？

1条答案

相关问题

热门标签

最新问答