如何使用python selenium在svg viewbox中提取数据

gkl3eglg  于 2023-08-08  发布在  Python
关注(0)|答案(2)|浏览(112)

我使用的是这个网址https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html,其中有一个svg视图框,当我尝试点击它显示工作时间。我需要用Python selenium 提取它们。有人能帮忙吗?我是Web Scraping的新手。

的数据

3xiyfsfu

3xiyfsfu1#

关于营业时间的数据以Json形式存储在HTML页面中,所以要获取营业时间,可以使用以下示例:

import re
import json
import requests

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0"
}

url = "https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html"

html_text = requests.get(url, headers=headers).text

data = re.search(r"window\.__WEB_CONTEXT__=(\{.*\});\(", html_text).group(1)
data = data.replace("pageManifest", '"pageManifest"')
data = json.loads(data)
data = data["pageManifest"]["redux"]["api"]["responses"]

for k, v in data.items():
    if "/hours" in k:
        print(v)
        break

字符串
印刷品:

{
    "data": {
        "openStatus": "CLOSED",
        "openStatusText": "Closed Now",
        "hoursTodayText": "Hours Today: 4:00 pm - 11:59 pm",
        "currentHoursText": "",
        "allOpenHours": [
            {"days": "Tue - Fri", "times": ["4:00 pm - 11:59 pm"]},
            {"days": "Sat - Sun", "times": ["11:00 am - 11:59 pm"]},
        ],
        "addHoursLink": {
            "url": "/UpdateListing-d7222445#Hours-only",
            "text": "+ Add hours",
        },
    },
    "error": None,
}

7vhp5slm

7vhp5slm2#

要单击 SVG 元素,您需要为element_to_be_clickable()引入WebDriverWait,您可以使用以下locator strategies

  • 代码块:
driver.get("https://www.tripadvisor.in/Restaurant_Review-g32655-d7222445-Reviews-The_Anchor-Los_Angeles_California.html")
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span[data-automation='top-info-hours'] > div svg[width='18px'] path:nth-child(2)"))).click()
print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Hours']//following::div[1]"))).text)

字符串

  • 控制台输出:
Sun
11:00 AM - 11:59 PM
Tue
4:00 PM - 11:59 PM
Wed
4:00 PM - 11:59 PM
Thu
4:00 PM - 11:59 PM
Fri
4:00 PM - 11:59 PM
Sat
11:00 AM - 11:59 PM

*注意:需要添加以下导入:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

相关问题