`from selenium import webdriver
import pandas as pd
import re
# Read the Excel file with the links
df = pd.read_excel('file.xlsx')
# Create empty lists to store the extracted data
company_names = []
earnings_dates = []
# Set up the Selenium driver
driver = webdriver.Chrome()
# Iterate over the links in the DataFrame
for index, row in df.iterrows():
url = row['Link'] # Assuming the links are in column 'Link'
# Load the URL in the browser
driver.get(url)
# Extract the company name using regular expressions
try:
html_content = driver.page_source
match = re.search(r'<h1 class="D\(ib\) Fz\(18px\)">(.*?)</h1>', html_content)
if match:
company_name = match.group(1)
else:
company_name = 'Company name not found'
except:
company_name = 'Company name not found'
# Extract the earnings date
try:
earnings_date_element = driver.find_element_by_xpath('//td[contains(text(), "Earnings Date")]/following-sibling::td')
earnings_date = earnings_date_element.text.strip()
except:
earnings_date = 'Earnings date not found'
# Append the extracted data to the lists
company_names.append(company_name)
earnings_dates.append(earnings_date)
# Close the Selenium driver
driver.quit()
# Create a new DataFrame with the extracted data
df_extracted = pd.DataFrame({'Link': df['Link'], 'Company Name': company_names, 'Earnings Date': earnings_dates})
# Print the extracted data
print(df_extracted)`
上面的代码我可以提取公司名称,但无法提取收入日期--
https://finance.yahoo.com/quote/A?p=A&.tsrc=fin-srch尝试提取以下结果**Agilent Technologies,Inc.(A)**盈利日期2023年8月14日至2023年8月18日
1条答案
按热度按时间gcmastyq1#
在Yahoo Finance网页上,公司名称位于网页上唯一的
<h1>
标签中:解决方案
要提取***公司名称***和***收益日期***,可以使用以下locator strategies:
控制台输出:
注意:需要添加以下导入: