我目前正试图把一个网站的文章刮刀,但我遇到了一个问题,我不知道如何解决.这是代码:
import newspaper
from newspaper import Article
import pandas as pd
import datetime
from datetime import datetime, timezone
import requests
from bs4 import BeautifulSoup
import re
urls = open("urls_test.txt").readlines()
final_df = pd.DataFrame()
for url in urls:
article = newspaper.Article(url="%s" % (url), language='en')
article.download()
article.parse()
article.nlp()
# scrape html part
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="main-content")
texts = results.find_all("div", class_="component article-body-text")
paragraphs = []
for snippet in texts:
paragraphs.append(str(snippet))
CLEANR = re.compile('<.*?>')
def remove_html(input):
cleantext = re.sub(CLEANR, '', input)
return cleantext
paragraphs_string = ' '.join(paragraphs)
paragraphs_clean = remove_html(paragraphs_string)
#
temp_df = pd.DataFrame(columns=['Title', 'Authors', 'Text', 'Summary', 'published_date', 'URL'])
temp_df['Authors'] = article.authors
temp_df['Title'] = article.title
temp_df['Text'] = paragraphs_clean
temp_df['Summary'] = article.meta_description
publish_date = article.publish_date
publish_date = publish_date.replace(tzinfo=None)
temp_df['published_date'] = publish_date
temp_df['URL'] = article.url
final_df = pd.concat([final_df, temp_df], ignore_index=True)
final_df.to_excel('Telegraph_test.xlsx')
我的问题出现在 #scrape html part 中。(不带 *#scrape html部分 * 的主代码和仅带 *#scrape html部分 * 的主代码)运行良好。(返回results
变量作为包含刮除材料的bs4.element.Tag
),但是当它继续运行时,results
变量变成了NoneType
。
AttributeError: 'NoneType' object has no attribute 'find_all'
1条答案
按热度按时间bhmjp9jg1#
在不知道任何URL和HTML结构的情况下,我会说有一个没有使用
id="main-content"
作为属性的元素-所以你应该总是检查你正在寻找的元素是否可用:不需要
remove_html()
,只需使用.get_text()
从元素中提取文本。