html 如何用BeautifulSoup从table上刮出元素？

6ojccjat 于 2023-11-15 发布在其他

关注(0)|答案(2)|浏览(122)

我试着提取这个页面右侧的内容：
https://portal.dnb.de/opac.htm?method=simpleSearch&cqlMode=true&query=idn%3D1173921214

的数据
当我们查看HTML时，信息存储在这个表中：

用我的代码片段，我无法到达我想要的文本。

def getDescriptionDNB():
    description = f'https://portal.dnb.de/opac.htm?method=simpleSearch&cqlMode=true&query=9783125466302'
    response = requests.get(description)
    soupedDescription = BeautifulSoup(response.content, "html.parser")
    text = soupedDescription.find(class_="amount").text
    if text == "Treffer 1 von 1":
        autor = soupedDescription.find_all("tr")
        for i in autor:
            test = i.findNext("td").text
            print(test)

字符串
问题是，我不知道如何深入到内部的<td>标记来获取我想要的信息。
你知道我怎样才能解决这个问题吗？

Html

来源：https://stackoverflow.com/questions/77432582/how-to-scrape-element-with-beautifulsoup-out-of-a-table

2条答案

按热度按时间

aiazj4mn1#

主要问题是-页面的HTML是破碎的，有一些tr没有td和没有关闭标签。*

尝试选择更具体的元素，或者尝试将信息存储在dict中并按键选择。
使用css selectors创建dict：

...
dict(
    row.get_text(':',strip=True).split(':',1) 
    for row in soup.select('tr:has(td:not([colspan]))')
)

字符串
使用pandas.read_html()创建dict：

import pandas as pd
url = f'https://portal.dnb.de/opac.htm?method=simpleSearch&cqlMode=true&query=9783125466302'
pd.read_html(url)[0].dropna().set_index(0)[1].to_dict()

型

输出

基于你的片段的URL。

{'Link zu diesem Datensatz': 'https://d-nb.info/94985462X',
 'Titel': 'Learning English - Password red:Teil: Reformierte Rechtschreibung / 3. / [Hauptw.].',
 'Ausgabe': '1. Aufl., 1. Dr.',
 'Verlag': 'Stuttgart ; Düsseldorf ; Leipzig : Klett',
 'Zeitliche Einordnung': 'Erscheinungsdatum: 1997',
 'Umfang/Format': '172 S. ; 25 cm',
 'ISBN/Einband/Preis': '978-3-12-546630-2 Pp. : DM 29.60:3-12-546630-X Pp. : DM 29.60:3-12-54663-0 (falsch) Pp. : DM 29.60',
 'Sprache(n)': 'Englisch (eng), Deutsch (ger)',
 'Frankfurt': 'Signatur: 1997 A 10551:Bereitstellung  in Frankfurt',
 'Leipzig': 'Signatur: 1997 A 10551:Bereitstellung  in Leipzig'}

型

赞(0）回复(0）举报 2023-11-15

o8x7eapl2#

你需要把键/值对分开，就像前面指出的那样。坚持使用BeautifulSoup（你选择的工具）-

teilen = i.find_all('td')
        if len(teilen)==2:
              print(teilen[0].text.strip(), ' : ', teilen[1].text.strip())

字符串
还有一些其他的事情。自己改进这个。相反，如果选择文档中的所有'tr'，则选择表，然后选择表：

table id="fullRecordTable"

型
然后继续选择其中的行（'tr'）。

赞(0）回复(0）举报 2023-11-15

我来回答

html 如何用BeautifulSoup从table上刮出元素？

2条答案

输出

相关问题

热门标签

最新问答