如何使用beautifulsoup获取最新帖子的文本,请选择()

zvokhttg  于 2021-09-08  发布在  Java
关注(0)|答案(1)|浏览(374)

我想使用beautifulsoup和select()方法获取最新的帖子文本。

import requests
from bs4 import BeautifulSoup

headers = 'User-Agent':'Mozilla/5.0'
url = "https:// " 
req = requests.get(url, headers=headers)
html = req.text       
soup = BeautifulSoup(html, 'html.parser')                
link = soup.select('#flagList > div.clear.ab-webzine > div > a')       
title = soup.select('#flagList > div.clear.ab-webzine > div > div.wz-item-header > a > span')         
latest_link = link[0] # link of latest post    
latest_title = title[0].text # title of latest post

# to get the text of latest post

t_url = latest_link
t_req = requests.get(t_url, headers=headers)
t_html = c_res.text
t_soup = BeautifulSoup(t_html, 'html.parser')  
maintext = t_soup.select ('#flagArticle > div.document_1234567_0.rhymix_content.xe_content')

print(maintext)

它返回[]。
我抄了 #flagArticle > div.document_1234567_0.rhymix_content.xe_content 从帖子上的chrome开发者工具。
所以它有具体的邮政编码“1234567”
但我想要的是“最新帖子”的文本,而不是某个帖子。
所以我把它改成了 #flagArticle 它返回如下。

[<article id="flagArticle">
<!--BeforeDocument(1234567,0)-->
<div class="document_1234567_0 rhymix_content xe_content"><p>TEXTTEXTTEXT 1</p>
<p>TEXTTEXTTEXT 2</p>
<p>TEXTTEXTTEXT 3</p></div><!--AfterDocument(1234567,0)-->
<!--
        -- color class --
        vb-white
        vb-green
        vb-blue
        vb-skyblue
        vb-orange
        vb-red
-->
<div class="vote">
<button class="vb-btn vb-orange" onclick="vote_doVote('Up','1234567');return false;" type="button">
<span class="lang">
<i class="fas fa-star fa-spin fa-fw"></i>
                                recommended            </span>
<span class="num" id="vm_v_count">
                        4               </span>
</button> <button class="vb-btn vb-skyblue" onclick="vote_doVote('Declare','1234567');return false;" type="button">
<span class="lang">
<i class="fa fa-times-circle"></i>
                        report            </span>
<span class="num" id="vm_d_count">
</span>
</button></div> </article>]

但是我想

TEXTTEXTTEXT 1
TEXTTEXTTEXT 2
TEXTTEXTTEXT 3

我应该换什么?
(我无法共享url,因为它是私有站点)

vhmi4jdf

vhmi4jdf1#

就拿第一个吧 div .

from bs4 import BeautifulSoup

data = '''\
<article id="flagArticle">
<!--BeforeDocument(1234567,0)-->
<div class="document_1234567_0 rhymix_content xe_content"><p>TEXTTEXTTEXT 1</p>
<p>TEXTTEXTTEXT 2</p>
<p>TEXTTEXTTEXT 3</p></div><!--AfterDocument(1234567,0)-->
<!--
        -- color class --
        vb-white
        vb-green
        vb-blue
        vb-skyblue
        vb-orange
        vb-red
-->
<div class="vote">
<button class="vb-btn vb-orange" onclick="vote_doVote('Up','1234567');return false;" type="button">
<span class="lang">
<i class="fas fa-star fa-spin fa-fw"></i>
                                recommended            </span>
<span class="num" id="vm_v_count">
                        4               </span>
</button> <button class="vb-btn vb-skyblue" onclick="vote_doVote('Declare','1234567');return false;" type="button">
<span class="lang">
<i class="fa fa-times-circle"></i>
                        report            </span>
<span class="num" id="vm_d_count">
</span>
</button></div> </article>
'''

soup = BeautifulSoup(data, 'html.parser')

div = soup.select_one('#flagArticle div.xe_content.rhymix_content')
for p in div.select('p'):
    print(p.text)

相关问题