我对这一切都是全新的。我试图从很多页面中提取文章,但我在下面的代码中只放了4个URL,并且只需要从<p>text</p> == $0
中提取重要的段落。以下是我的示例代码:
currency = 'BTC'
btc_today = pd.DataFrame({'Currency':[],
'Date':[],
'Title': [],
'Content': [],
'URL':[]})
links = ["https://www.investing.com/news/cryptocurrency-news/3-reasons-why-bitcoins-drop-to-21k-and-the-marketwide-selloff-could-be-worse-than-you-think-2876810",
"https://www.investing.com/news/cryptocurrency-news/crypto-flipsider-news--btc-below-22k-no-support-for-pow-eth-ripple-brazil-odl-cardano-testnet-problems-mercado-launches-crypto-2876644",
"https://www.investing.com/news/cryptocurrency-news/can-exchanges-create-imaginary-bitcoin-to-dump-price-crypto-platform-exec-answers-2876559",
"https://www.investing.com/news/cryptocurrency-news/bitcoin-drops-7-to-hit-3week-lows-432SI-2876376"]
for link in links:
driver.get(link)
driver.maximize_window()
time.sleep(2)
data = []
date = driver.find_element(By.XPATH, f'/html/body/div[5]/section/div[1]/span').text.strip()
title = driver.find_element(By.XPATH,f'/html/body/div[5]/section/h1').text.strip()
url = link
content = driver.find_elements(By.TAG_NAME, 'p')
for item in content:
body = item.text
print(body)
articles = {'Currency': currency,'Date': date,'Title': title,'Content': body,'URL': url}
btc_today = btc_today.append(pd.DataFrame(articles, index=[0]))
btc_today.reset_index(drop=True, inplace=True)
btc_today
我得到了这个结果output我也尝试过用这个循环来做,但是它返回的结果是很多行,而不是一篇一篇的
for p_number in range(1,10):
try:
content = driver.find_element(By.XPATH, f'/html/body/div[5]/section/div[3]/p[{p_number}]').text.strip()
#print(content)
except NoSuchElementException:
pass
有人能帮忙吗?我真的非常感激。我真的尽了最大的努力找了几天的解决方案,但没有进展
1条答案
按热度按时间f5emj3cl1#
我假设您需要获取主要内容,为此,请更改
'content'
的定位器:此外,还有不必要的“
<p>
”标记,其内容为“Position added successfully to:“和“Continue阅读on DailyCoin”,您可以忽略下面for循环中的using if语句: