Chrome 502 Bad Gateway Cloudflare错误时使用Selenium进行Web抓取

bgtovc5b  于 2023-06-19  发布在  Go
关注(0)|答案(1)|浏览(261)

我目前正在使用Python中的selenium进行网络抓取在线数据库。数据库的格式需要在页面之间导航,以便抓取我感兴趣的数据,每次我运行代码时,我总是会遇到502 Bad Gateway Error(下图)。

这个错误消息似乎会消失 * 有时 *,但它似乎取决于这个502在循环中弹出的位置。任何关于如何避免这种情况的建议将不胜感激。我还附上了下面与Chrome交互的部分代码以供参考:

# ! Final !
#### Define Driver & Starting URL ####
# Location of chromedriver
driver_path = "/Users/shrey/Desktop/Python Projects/Selenium/chromedriver"

# Beginning url & initialize driver
url = "https://tamu.libguides.com/az.php"
driver = webdriver.Chrome()

# Make driver wait for elements to load when find_element() is run for the rest of our code
driver.implicitly_wait(10)

# Launch driver
driver.get(url)

# Press "Ancestry Database" link
driver.find_element(By.LINK_TEXT,
                    "Ancestry Library").click()

# Give time for user to login to database
time.sleep(30)

# Go to link where we can search from
home = "https://www.ancestrylibrary.com/search/collections/1742/"
driver.get(home)

# Switch to first tab (Search tab we just opened)
driver.switch_to.window(driver.window_handles[0])

#### Loop through each year present in the data ####
for yr in range(1886, 1952):
    # Go to search home
    driver.get(home)
    
    # Find textbox & Input Year --------
    year_input = driver.find_element(By.CSS_SELECTOR, "#sfs_SelfCivilYear")
    year_input.send_keys(str(yr))

    # Press "search" button
    driver.find_element(By.CSS_SELECTOR, "#searchButton").click()

    # Determine number of times we need to loop --------
    # Find text which includes total number of results (formatted as "Results 1–20 of 1,351")
    n_raw = driver.find_element(By.XPATH,
                                '//*[@id="results-header"]/h3').text

    # Isolate the important number (1,351)
    n_num = (tot_results.split()[-1]) # pulls the last word from the string - our desired number

    # Remove comma and convert to number ("1,351" >>> 1351)
    n_total = int(re.sub(",", "", n_num))

    # Determine number of loops we need to do to scrape all the data
    loop_count = math.floor(n_total/20) + 1

    # Loop thru pages and collect links --------
    # Init empty list
    links = []
    
    # Loop n times (calc'd earlier)
    for i in range(loop_count):
        
        # If we are on our last iter, do the same but do not click "next page" button
        if i == range(loop_count)[-1]: 
            # Find & Store all "View Result" links
            current_pg_links = driver.find_elements(By.CSS_SELECTOR, 
                                                    ".srchFoundDB a")

            # Loop through all links pulled & append
            for link in current_pg_links:
                # Get actual url from 'href' attribute
                url = link.get_attribute('href')

                # Append URL to final list
                links.append(url)

        else:
            # Find & Store all "View Result" links
            current_pg_links = driver.find_elements(By.CSS_SELECTOR, 
                                                    ".srchFoundDB a")

            for link in current_pg_links:
                # Get actual url from 'href' attribute
                url = link.get_attribute('href')

                # Append URL to final list
                links.append(url)

            # Press "next page" button
            driver.find_element(By.CSS_SELECTOR,
                                "a.ancBtn.sml.green.icon.iconArrowRight").click()
iaqfqrcu

iaqfqrcu1#

502 Bad Gateway Cloudflare错误

当Cloudflare无法与您网站的原始Web服务器建立有效连接时,会发生502 Bad Gateway Cloudflare错误。虽然此错误消息与服务器端(即您的Web主机),如果Cloudflare服务关闭或未正确配置,也可能发生这种情况。

详情

当您访问一个网站时,客户端会向Web服务器发送请求。Web服务器接收并处理请求,然后将所请求的资源沿着HTTP报头和HTTP状态代码一起发送回。通常情况下,除非出现错误,否则不会看到HTTP状态代码。但是当您在网站上使用Cloudflare时,请求会在到达客户端之前发送到Cloudflare。当Cloudflare无法与您网站的原始Web服务器建立有效连接时,会发生502 Bad Gateway Cloudflare错误。虽然此错误消息与服务器端有关,但如果Cloudflare服务关闭或未正确配置,也可能发生此错误。这是服务器通知您发生错误的方式沿着如何诊断它的代码。
举个例子:

基于您的Web服务器和浏览器,您可能会看到不同的502错误,但它们都意味着相同的事情:

  • 502 Bad Gateway
    • 错误502*
  • *502代理服务器 *
  • HTTP 502
  • 502代理错误 *
  • 错误(502)
  • HTTP错误502 - Bad Gateway
  • 502 Bad Gateway Nginx
  • 服务器错误:Web服务器遇到临时错误,无法完成您的请求 *
  • *502错误 *
  • *502服务暂时过载 *

一些网站还可以自定义502网关错误的外观。然而,所有变化具有相同的含义,即充当代理的服务器尚未从源服务器接收到有效响应。

原因

此502 Bad Gateway Cloudflare错误的两个可能原因是:

  • 502来自源web服务器的状态代码
  • 502来自Cloudflare的错误

解决方案

502 Bad Gateway Cloudflare错误是网络/服务器问题,但有时也可能是客户端问题。因此,客户端修复错误以恢复运行的一些常见步骤如下:

    • 清除浏览器缓存并重新加载页面 *。
    • 检查DNS服务器问题 *。
    • 检查主机 *。
    • 暂时禁用Cloudflare代理 *。
    • 暂时关闭CDN或防火墙 *。
    • 检查插件/主题冲突 *。

tl; dr

相关问题