我需要帮助为标题链接url找到正确的html标记我的刮板的目的是刮标题,故事,链接

mec1mxoz  于 2021-08-20  发布在  Java
关注(0)|答案(1)|浏览(378)

当我运行scraper时,django主页上的输出是正常的,但是url显示一条错误消息404和其他文章,显示我使用了错误的标记https://www.coindesk.com/news/tag/crypto-lending 正确的链接url是https://www.coindesk.com/news/tag/crypto-lending. 带有链接的正确标记是<a title= href<。我怎么写这个标签

from bs4 import BeautifulSoup
import requests

crypto_headlines = []

def crypto_news():
    """ user agent to facilitates end-user interaction with web content"""

    headers = {
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36'
    }

    base_url ='https://www.coindesk.com/news'

    source = requests.get(base_url).text

    soup = BeautifulSoup(source, "html.parser")       

    articles = soup.find_all(class_ = 'text-content')

    #print(len(articles))
    #print(articles) 

    for article in articles:

        try:

            headline = article.h4.text.strip()
            text = article.find(class_="card-text").text.strip()
            link = base_url + article.a['href']
            #img_url = base_url + article.image_src['src']

            crypto_dict = {}

            crypto_dict['Headline']= headline
            crypto_dict['Text'] = text
            crypto_dict['Link']= link

            crypto_headlines.append(crypto_dict)
        except AttributeError as ex:
            print('Error:', ex)

    print(crypto_headlines)

crypto_news()
vmdwslir

vmdwslir1#

你错了 <a> ,你是从第一个刮来的 <a> 但需要的链接在第二位 <a> .
这是密码

link = base_url + article.find_all("a")[1]["href"]

只要换一条线就能解决你的问题!

相关问题