python Selenium WebDriver仅提取段落

mv1qrgav  于 2022-11-21  发布在  Python
关注(0)|答案(1)|浏览(125)

我对这一切都是全新的。我试图从很多页面中提取文章,但我在下面的代码中只放了4个URL,并且只需要从<p>text</p> == $0中提取重要的段落。以下是我的示例代码:

currency = 'BTC'
btc_today = pd.DataFrame({'Currency':[],
                                'Date':[],
                                'Title': [],
                                'Content': [],
                                'URL':[]})

links = ["https://www.investing.com/news/cryptocurrency-news/3-reasons-why-bitcoins-drop-to-21k-and-the-marketwide-selloff-could-be-worse-than-you-think-2876810",
"https://www.investing.com/news/cryptocurrency-news/crypto-flipsider-news--btc-below-22k-no-support-for-pow-eth-ripple-brazil-odl-cardano-testnet-problems-mercado-launches-crypto-2876644",
    "https://www.investing.com/news/cryptocurrency-news/can-exchanges-create-imaginary-bitcoin-to-dump-price-crypto-platform-exec-answers-2876559",
    "https://www.investing.com/news/cryptocurrency-news/bitcoin-drops-7-to-hit-3week-lows-432SI-2876376"]

for link in links:
  driver.get(link)
  driver.maximize_window()
  time.sleep(2)
  data = []
  date = driver.find_element(By.XPATH, f'/html/body/div[5]/section/div[1]/span').text.strip()
  title = driver.find_element(By.XPATH,f'/html/body/div[5]/section/h1').text.strip()
  url = link
  content = driver.find_elements(By.TAG_NAME, 'p')
  for item in content:
    body = item.text
    print(body)
  articles = {'Currency': currency,'Date': date,'Title': title,'Content': body,'URL': url}
  btc_today = btc_today.append(pd.DataFrame(articles, index=[0]))
  btc_today.reset_index(drop=True, inplace=True)
  btc_today

我得到了这个结果output我也尝试过用这个循环来做,但是它返回的结果是很多行,而不是一篇一篇的

for p_number in range(1,10):
    try:
      content = driver.find_element(By.XPATH, f'/html/body/div[5]/section/div[3]/p[{p_number}]').text.strip()
      #print(content)
    except NoSuchElementException:
      pass

有人能帮忙吗?我真的非常感激。我真的尽了最大的努力找了几天的解决方案,但没有进展

f5emj3cl

f5emj3cl1#

我假设您需要获取主要内容,为此,请更改'content'的定位器:

content = driver.find_elements(By.CSS_SELECTOR, '.WYSIWYG.articlePage p')

此外,还有不必要的“<p>”标记,其内容为“Position added successfully to:“和“Continue阅读on DailyCoin”,您可以忽略下面for循环中的using if语句:

for item in content:
    body = item.text
    print(body)

相关问题