使用Selenium获取部分字符串匹配的元素文本(Python)

vawmfj5a 于 2022-11-24 发布在 Python

关注(0)|答案(4)|浏览(297)

我正在尝试从<strong>标记中提取文本，该标记深深嵌套在此网页的HTML内容中：https://www.marinetraffic.com/en/ais/details/ships/imo:9854612
例如：

强标记是网页上唯一包含字符串“立方米”的标记。
我的目标是提取整个文本，即“138124立方米液化气”。当我尝试执行以下操作时，出现错误：

url = "https://www.marinetraffic.com/en/ais/details/ships/imo:9854612"
driver.get(url)
time.sleep(3)
element = driver.find_element_by_link_text("//strong[contains(text(),'cubic meters')]").text
print(element)

错误：
无此类元素异常：消息：没有此元素：找不到元素：{“方法”：“链接文本”，“选择器”：“//strong[包含（文本（），'立方米'）]"}
我做错了什么？
下列程式码也会掷回错误：

element = driver.find_element_by_xpath("//strong[contains(text(),'cubic')]").text

selenium

来源：https://stackoverflow.com/questions/68106875/get-element-text-with-a-partial-string-match-using-selenium-python

4条答案

按热度按时间

wwtsj6pe1#

您的代码可以在Firefox()上运行，但不能在Chrome()上运行。
该页面使用 lazy loading，因此您必须滚动到 Summary，然后它使用预期的 strong 加载文本。
我使用了一个稍微慢一点的方法--我搜索所有包含class='lazyload-wrapper的元素，并在循环中滚动到该项目，检查是否存在 strong。如果没有任何 strong，则滚动到下一个class='lazyload-wrapper。

from selenium import webdriver
import time

#driver = webdriver.Firefox()
driver = webdriver.Chrome()

url = "https://www.marinetraffic.com/en/ais/details/ships/imo:9854612"
driver.get(url)
time.sleep(3)

from selenium.webdriver.common.action_chains import ActionChains

actions = ActionChains(driver)
elements = driver.find_elements_by_xpath("//span[@class='lazyload-wrapper']")

for number, item in enumerate(elements):
    print('--- item', number, '---')
    #print('--- before ---')
    #print(item.text)

    actions.move_to_element(item).perform()
    time.sleep(0.1)

    #print('--- after ---')
    #print(item.text)

    try:
        strong = item.find_element_by_xpath("//strong[contains(text(), 'cubic')]")
        print(strong.text)
        break
    except Exception as ex:
        #print(ex)
        pass

结果：

--- item 0 ---
--- item 1 ---
--- item 2 ---
173400 cubic meters Liquid Gas

结果显示，我可以使用elements[2]跳过两个元素，但我不确定该文本是否始终位于第三个元素中。
在我创建我的版本之前，我测试了其他版本，下面是完整的工作代码：

from selenium import webdriver
import time

#driver = webdriver.Firefox()
driver = webdriver.Chrome()

url = "https://www.marinetraffic.com/en/ais/details/ships/imo:9854612"
driver.get(url)
time.sleep(3)

def test0():
    elements = driver.find_elements_by_xpath("//strong")
    for item in elements:
        print(item.text)

    print('---')

    item = driver.find_element_by_xpath("//strong[contains(text(), 'cubic')]")
    print(item.text)

def test1a():
    from selenium.webdriver.common.action_chains import ActionChains

    actions = ActionChains(driver)
    element = driver.find_element_by_xpath("//div[contains(@class,'MuiTypography-body1')][last()]//div")
    actions.move_to_element(element).build().perform()
    text = element.text
    print(text)

def test1b():
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(0.5)
    text = driver.find_element_by_xpath("//div[contains(@class,'MuiTypography-body1')][last()]//strong").text
    print(text)

def test2():
    from bs4 import BeautifulSoup
    import re
    soup = BeautifulSoup(driver.page_source, "html.parser")
    soup.find_all(string=re.compile(r"\d+ cubic meters"))

def test3():
    from selenium.webdriver.common.action_chains import ActionChains

    actions = ActionChains(driver)
    elements = driver.find_elements_by_xpath("//span[@class='lazyload-wrapper']")

    for number, item in enumerate(elements, 1):
        print('--- number', number, '---')
        #print('--- before ---')
        #print(item.text)

        actions.move_to_element(item).perform()
        time.sleep(0.1)

        #print('--- after ---')
        #print(item.text)

        try:
            strong = item.find_element_by_xpath("//strong[contains(text(), 'cubic')]")
            print(strong.text)
            break
        except Exception as ex:
            #print(ex)
            pass

#test0()
#test1a()
#test1b()
#test2()
test3()

赞(0）回复(0）举报 2022-11-24

e4eetjau2#

您可以使用Beautiful Soup来实现这一点，更精确地说，可以使用string参数;从文档中，“您可以搜索字符串而不是标记”。
作为参数，您也可以传递正则表达式模式。

>>> from bs4 import BeautifulSoup
>>> import re
>>> soup = BeautifulSoup(driver.page_source, "html.parser")
>>> soup.find_all(string=re.compile(r"\d+ cubic meters"))
['173400 cubic meters Liquid Gas']

如果确定只有一个结果，或者只需要第一个结果，也可以使用find代替find_all。

赞(0）回复(0）举报 2022-11-24

pqwbnv8z3#

您的XPath表达式是正确的，并且在Chrome中可以工作。您得到NoSuchElementException，因为元素在您等待的3秒内没有加载，并且不存在。
要等待元素，请使用WebDriverWait类。它显式地等待元素的特定条件，在您的情况下，present就足够了。
在下面的代码中，Selenium将等待元素在HTML中显示10秒，每500毫秒轮询一次。
一些有用的信息：
不可见的元素返回一个空字符串。在这种情况下，你需要等待元素的可见性，或者如果元素需要滚动到它（添加的示例）。
您也可以使用JavaScript从不可见元素中获取文本。

from selenium.webdriver.common.by import By
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium import webdriver

url = "https://www.marinetraffic.com/en/ais/details/ships/imo:9854612"
locator = "//strong[contains(text(),'cubic meters')]"

with webdriver.Chrome() as driver:  # Type: webdriver
    wait = WebDriverWait(driver, 10)

    driver.get(url)

    cubic = wait.until(ec.presence_of_element_located((By.XPATH, locator)))  # Type: WebElement
    print(cubic.text)

    # The below examples are just for information
    # and are not needed for the case

    # Example with scroll. Scroll to the element to make it visible
    cubic.location_once_scrolled_into_view
    print(cubic.text)

    # Example using JavaScript. Works for not visible elements.
    text = driver.execute_script("return arguments[0].textContent", cubic)
    print(text)

使用marinetraffic API是正确的。

赞(0）回复(0）举报 2022-11-24

envsm3lx4#

我想你应该先滚动到那个元素，然后才能尝试访问它，包括获取它的文本。

from selenium.webdriver.common.action_chains import ActionChains

actions = ActionChains(driver)
element = driver.find_element_by_xpath("//div[contains(@class,'MuiTypography-body1')][last()]//div")
actions.move_to_element(element).build().perform()
text = element.text

如果以上仍然不够好，您可以像这样滚动页面高度一次：

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(0.5)
the_text = driver.find_element_by_xpath("//div[contains(@class,'MuiTypography-body1')][last()]//strong").text

赞(0）回复(0）举报 2022-11-24

我来回答

使用Selenium获取部分字符串匹配的元素文本(Python)

4条答案

相关问题

热门标签

最新问答