Selenium(或Scrapy)未使用Python呈现页面源,但浏览器正确显示了它

ycl3bljg  于 2023-01-08  发布在  Python
关注(0)|答案(1)|浏览(126)

我想通过网页抓取一个管理大量数据的内部网站来自动化一些日常工作中的任务。这个内部网站是用JS呈现的,为了抓取它,我尝试使用Python和Selenium,但它不起作用。页面在打开的新标签中正确显示,但当我打印页面源代码时,看起来javascript没有启用
下面你可以找到我的代码。http://intranet.page:port/path只是一个占位符。

import os
import undetected_chromedriver as uc
from selenium.webdriver.support.wait import WebDriverWait

def document_initialised(driver):
    return True

os.environ['PATH'] += r"D:/SeleniumDrivers"
driver = uc.Chrome()

driver.get("http://intranet.page:port/path")
WebDriverWait(driver, timeout=10).until(document_initialised)
print(driver.page_source)

我也尝试了undetected_chromedriver,但什么都没有。我尝试了不同的浏览器(边缘),相同的结果
我也试过Scrapy,但返回了很多错误,下面是最常见的错误:

>>> fetch('http://intranet.page:port/path')  
2022-12-28 23:08:42 [scrapy.core.engine] INFO: Spider opened
2022-12-28 23:08:42 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://intranet.page:port/path/robots.txt> (referer: None)
2022-12-28 23:08:42 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2022-12-28 23:08:42 [filelock] DEBUG: Attempting to acquire lock 2334601525136 on c:\legacyapp\python\python39\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-12-28 23:08:42 [filelock] DEBUG: Lock 2334601525136 acquired on c:\legacyapp\python\python39\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-12-28 23:08:42 [filelock] DEBUG: Attempting to release lock 2334601525136 on c:\legacyapp\python\python39\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-12-28 23:08:42 [filelock] DEBUG: Lock 2334601525136 released on c:\legacyapp\python\python39\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-12-28 23:08:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://intranet.page:port/path> (referer: None)
2022-12-28 23:08:42 [scrapy.core.scraper] ERROR: Spider error processing <GET http://intranet.page:port/path> (referer: None)
Traceback (most recent call last):
qgzx9mmu

qgzx9mmu1#

这是由于iframe,它们需要被特别威胁(通过使用switch_to.frame)

相关问题