python 美丽的汤-获得所有文本,并保留链接html?

z4bn682m  于 2023-03-07  发布在  Python
关注(0)|答案(1)|浏览(108)

我正在使用漂亮的汤解析多个HTML页面。大多数情况下工作得很好。我想包括文本以及链接的URL。
当前语法为:

soup = MyBeautifulSoup(''.join(body), 'html.parser')
            body_text = self.remove_newlines(soup.get_text())

我在网上找到了一个覆盖_all_strings函数的建议:

class MyBeautifulSoup(BeautifulSoup):
    def _all_strings(self, strip=False, types=(NavigableString, CData)):
        for descendant in self.descendants:
            # return "a" string representation if we encounter it

            if isinstance(descendant, Tag) and descendant.name == 'a':
                yield str('<{}> '.format(descendant.get('href', '')))

            # skip an inner text node inside "a"
            if isinstance(descendant, NavigableString) and descendant.parent.name == 'a':
                continue

            # default behavior
            if (
                (types is None and not isinstance(descendant, NavigableString))
                or
                    (types is not None and type(descendant) not in types)):
                continue

            if strip:
                descendant = descendant.strip()
                if len(descendant) == 0:
                    continue
            yield descendant

但是,这会产生运行时错误:

in _all_strings
    (types is not None and type(descendant) not in types)):
TypeError: argument of type 'object' is not iterable

是否有方法绕过此错误?
谢谢!

u5i3ibmn

u5i3ibmn1#

  • 尝试将and type(descendant) not in types更改为**and not isinstance(descendant, types)***

使用您建议的更改,我得到了相同的错误

in _all_strings
    (types is not None and type(descendant) not in types)):
TypeError: argument of type 'object' is not iterable

但是更改后的错误应该是不同的-错误追溯中提到的行(* types is not None and type(descendant) not in types) *)正是我建议更改的行...
无论如何,我对此进行了更深入的研究,我建议的更改只是引起了另一种错误;主要问题是由get_text使用types=default调用_all_strings引起的;default在这个新子类中似乎没有任何意义(这就是为什么您会看到 * <class 'object'> .... *)。
你可以在函数的开头检查types,确保它是一个类或者是一个元组:

class MyBeautifulSoup(BeautifulSoup):
    def _all_strings(self, strip=False, types=(NavigableString, CData)):
        # verify types [ADDED]
        if hasattr(types,'__iter__'):
            types = tuple([t for t in types if isinstance(t,type)])
        if types is None: types = NavigableString
        if not types or not isinstance(types, (type, tuple)):
            types = (NavigableString, CData)

        for descendant in self.descendants:
            # return "a" string representation if we encounter it

            if isinstance(descendant, Tag) and descendant.name == 'a':
                yield str('<{}> '.format(descendant.get('href', '')))

            # skip an inner text node inside "a"
            if isinstance(descendant,NavigableString) and descendant.parent.name == 'a':
                continue
            
            if not isinstance(descendant, types): continue # default behavior [EDITED]

            if strip:
                descendant = descendant.strip()
                if len(descendant) == 0:
                    continue
            yield descendant

然而,不使用default作为types也意味着JavaScript和CSS不再被排除在外。所以[个人]我更愿意用两个全新的方法编写新类,并使用它们:

# from urllib.parse import urljoin

class MyBeautifulSoup(BeautifulSoup):
    def _all_strings_plus(  self, strip=True, types=NavigableString, 
                            aRef={'a': lambda a: f"<{a.get('href', '')}>"}, 
                            skipTags=['script', 'style']    ):
        # verify types
        if hasattr(types,'__iter__') and not isinstance(types,type):
            types = tuple([t for t in types if isinstance(t, type)])
        if not (types and isinstance(types,(type,tuple))): types = NavigableString
        
        # skip text in tags included in aRef
        # skipTags += list(aRef.keys())
        
        for descendant in self.descendants:
            # yield extra strings according to aRef
            if isinstance(descendant, Tag) and descendant.name in aRef:
                extraStr = aRef[descendant.name](descendant)
                if isinstance(extraStr, str): yield extraStr

            # skip text nodes DIRECTLY inside a Tag in aRef
            # if descendant.parent.name in aRef: continue

            # skip ALL text nodes inside skipTags 
            if skipTags and descendant.find_parent(skipTags): continue

            # default behavior
            if not isinstance(descendant, types): continue

            if strip:
                descendant = descendant.strip()
                if len(descendant) == 0: continue
            yield descendant
    
    def get_text_plus(self, separator=" ", srcUrl=None, **aspArgs):
        if srcUrl and isinstance(srcUrl, str):
            def hrefStr(aTag):
                href = aTag.get('href')
                if not (href is None or href.startswith('javascript:')):
                    return f"<{urljoin(srcUrl, href)}>"
            aspArgs.setdefault('aRef', {})
            aspArgs['aRef']['a'] = hrefStr
        
        return separator.join(self._all_strings_plus(**aspArgs))

这还允许您指定源URL [如soup.get_text(srcUrl='https://example.com/')],以便relative paths可以转换为完整链接,并且可以忽略JavaScript链接(如果您不介意忽略这些链接,请删除条件中的 * or href.startswith('javascript:') * 部分)。
顺便说一句,我已经注解掉了a标记[或者,更确切地说,aRef中包含的任何标记]中省略字符串的部分,但是您可以根据需要取消注解它们;但是,如果取消注解 * skipTags += list(aRef.keys()) * 行,则 * if descendant.parent.name in aRef: continue * 行将变得多余。

相关问题