Chrome 如何使用selenium进行google flights的网页抓取?

nzkunb0c  于 2023-04-27  发布在  Go
关注(0)|答案(2)|浏览(135)

我试图拉一个特定航班的航空公司名称和价格.我遇到了麻烦与x.path和/或使用正确的html标签,因为当我运行下面的代码,我得到的只是14个空列表.

from selenium import webdriver
from lxml import html
from time import sleep

driver = webdriver.Chrome(r"C:\Users\14074\Python\chromedriver")
URL = 'https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA'

driver.get(URL)

sleep(1)

tree = html.fromstring(driver.page_source)

for flight_tree in tree.xpath('//div[@class="TQqf0e sSHqwe tPgKwe ogfYpf"]'):
     title = flight_tree.xpath('.//*[@id="yDmH0d"]/c-wiz[2]/div/div[2]/div/c-wiz/div/c-wiz/div[2]/div[2]/div/div[2]/div[6]/div/div[2]/div/div[1]/div/div[1]/div/div[2]/div[2]/div[2]/span/text()')
     price = flight_tree.xpath('.//span[contains(@data-gs, "CjR")]')

     print(title, price)
    
#driver.close()

这只是我的代码的第一部分,但我真的不能继续没有得到这个工作。如果有人有一些想法,我做错了,这将是惊人的!它已经把我逼疯了。谢谢!

cmssoen2

cmssoen21#

我注意到你的代码有几个问题。首先,我相信当进入这个页面时,首先谷歌会在向你显示页面内容之前向你显示“我同意条款和条件”弹出窗口,因此你需要先点击那个按钮。
此外,你应该直接在驱动程序上使用find_elements_by_xpath函数,而不是使用页面内容,因为这也允许你渲染javascript内容。你可以在这里找到更多信息:python tree.xpath return empty list
我使用了下面的代码来抓取标题。(我还更改了xpath,直接从google chrome中提取它们。您可以通过右键单击元素-〉inspect并在元素所在的元素选项卡中右键单击-〉copy -〉Copy xpath来完成此操作)

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

# I used these for the code to work on my windows subsystem linux
option = webdriver.ChromeOptions()
option.add_argument('--no-sandbox') 
option.add_argument('--disable-dev-sh-usage')

driver = webdriver.Chrome(ChromeDriverManager().install(), options=option)
URL = 'https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA'

driver.get(URL)

driver.find_element_by_xpath('//*[@id="yDmH0d"]/c-wiz/div/div/div/div[2]/div[1]/div[4]/form/div[1]/div/button/span').click() # this is necessary to pres the I agree button

elements = driver.find_elements_by_xpath('//*[@id="yDmH0d"]/c-wiz[2]/div/div[2]/div/c-wiz/div/c-wiz/div[2]/div[3]/div[3]/c-wiz/div/div[2]/div[1]/div/div/ol/li')

for flight_tree in elements:
     title = flight_tree.find_element_by_xpath('.//*[@class="W6bZuc YMlIz"]').text

     print(title)
k97glaaz

k97glaaz2#

我尝试了下面的代码,屏幕最大化并有明确的等待,可以成功提取信息,请参阅下面:

示例代码:

driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.get("https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA")
wait = WebDriverWait(driver, 10)
titles = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div/descendant::h3")))

for name in titles:
  print(name.text)
  price = name.find_element(By.XPATH, "./../following-sibling::div/descendant::span[2]").text
  print(price)

进口:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

输出:

Tokyo
₹38,473
Mumbai
₹3,515
Dubai
₹15,846

相关问题