selenium Python -需要帮助网页搜罗动态网站

a11xaf1n  于 2022-12-13  发布在  Python
关注(0)|答案(1)|浏览(149)

我是一个很新的网页抓取,并希望为以下场景的任何建议:
我正在尝试使用https://www.canstar.com.au/home-loans/中的数据生成一个住房贷款列表
我主要是想得到像下面这样的列表值:

  • 家园星金融|星星基本P&I 80%|变量
  • 取消贷款|住房贷款LVR〈80%|变量
  • TicToc住房贷款|入住可变P&I|变量
  • 银行|纯住房贷款业主自有P&I 70-80%|变量

然后将它们推入嵌套表结果= [[Homestar Finance,Star Essentials P&I 80%,Variable]等]
我的第一次尝试,我已经完全使用BeautifulSoup和实践的离线版本的网站。

import pandas as pd
from bs4 import BeautifulSoup

with open('/local/path/canstar.html', 'r') as canstar_offline :
    content = canstar_offline.read()

results = [['Affiliate', 'Product Name', 'Product Type']]
    
soup = BeautifulSoup(content, 'lxml')

for listing in soup.find_all('div', class_='table-cards-container') :
    for listing1 in listing.find_all('a') :
        if listing1.text.strip() != 'More details' and listing1.text.strip() != '' :
            results.append(listing1.text.strip().split(' | '))
   
df = pd.DataFrame(results[1:], columns=results[0]).to_dict('list')
df2 = pd.DataFrame(df)

print(df2)

我几乎得到了非常接近我想要的,但不幸的是,它不工作的实际网站,因为它看起来像我得到阻止重复的请求。
所以我又试了一次 selenium ,但现在我卡住了。
我试着使用我从BS中使用的可转移过滤逻辑,但是我不能接近我使用Selenium时的效果。

import time
from selenium.webdriver.common.by import By

url = 'https://www.canstar.com.au/home-loans'

results = []

driver = webdriver.Chrome()
driver.get(url)
# content = driver.page_source
# soup = BeautifulSoup(content)

time.sleep(3)
tables = driver.find_elements(By.CLASS_NAME, 'table-cards-container')
for table in tables :
    listing = table.find_element(By.TAG_NAME, 'a')
    print(listing.text)

这个版本(上面的)只返回一个列表(我试图通过迭代得到整个表)

import time
from selenium.webdriver.common.by import By

url = 'https://www.canstar.com.au/home-loans'

results = []

driver = webdriver.Chrome()
driver.get(url)
# content = driver.page_source
# soup = BeautifulSoup(content)

time.sleep(3)
tables = driver.find_elements(By.CLASS_NAME, 'table-cards-container')
for table in tables :
#     listing = table.find_element(By.TAG_NAME, 'a')
    print(table.text)

这个版本(上面的)看起来像是从'table-cards-container'类中获取所有文本,但是我无法过滤它来获取列表。

unftdfkk

unftdfkk1#

我想你可以尝试这样的东西,我希望代码中的注解解释它在做什么。

# Needed libs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Initiate the driver and navigate
driver = webdriver.Chrome()
url = 'https://www.canstar.com.au/home-loans'
driver.get(url)

# We save the loans list
loans = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, "//cnslib-table-card")))

# We make a loop once per loan in the loop
for i in range(1, len(loans)):
    # With this Xpath I save the title of the loan
    loan_title = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//a)[1]"))).text
    print(loan_title)
    # With this Xpath I save the first percentaje we see for the loan
    loan_first_percentaje = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//span)[1]"))).text
    print(loan_first_percentaje)
    # With this Xpath I save the second percentaje we see for the loan
    loan_second_percentaje = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//span)[3]"))).text
    print(loan_second_percentaje)
    # With this Xpath I save the amount we see for the loan
    loan_amount = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//span)[5]"))).text
    print(loan_amount)

相关问题