Chrome 自动更改Google搜索页面

5cg8jx4n 于 2023-04-27 发布在 Go

关注(0)|答案(2)|浏览(148)

所以我想弄清楚如何用python

自动更改谷歌搜索页面
这些就是我的意思。我的问题是url的改变时没有真正的模式我的意思是这个

所以我猜我必须找到某种功能，自动找到正确的网址为下一页，但我甚至不知道，如果这是可行的或不。我的问题是，它甚至有可能做到这一点，如果它是我应该从哪里开始，我怎么能做到这一点。

google-chrome

来源：https://stackoverflow.com/questions/66749409/automatically-changing-google-search-pages

2条答案

按热度按时间

q35jwt9p1#

最好使用Selenium库进行此作业，请阅读此链接：
https://realpython.com/modern-web-automation-with-python-and-selenium/

赞(0）回复(0）举报 2023-04-27

j5fpnvbx2#

要遍历所有页面并收集所需数据，您可以使用while循环使用基于非令牌的分页。无论有多少页，它都会遍历所有页面。这意味着它不是硬编码（for i in range(<value>)）的分页方式。
另外，请记住，request和soup必须在while循环中，才能通过新页面更新HTML响应结果。
只要存在下一个按钮，就可以分页（由页面上是否存在按钮选择器决定，在我们的例子中是CSS选择器.d6cvqb a[id=pnnext]，你需要将["start"]的值增加10才能访问下一个页面，如果存在的话，否则，我们需要退出while循环：

if soup.select_one('.d6cvqb a[id=pnnext]'):
    params["start"] += 10
else:
    break

在联机IDE中检查完整代码。

from bs4 import BeautifulSoup
import requests, json, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "cars",         # query example
    "hl": "en",          # language
    "gl": "uk",          # country of the search, UK -> United Kingdom
    "start": 0,          # number page by default up to 0
    #"num": 100          # parameter defines the maximum number of results to return.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}

page_limit = 10                # page limit for example

page_num = 0

data = []

# pagination
while True:
    page_num += 1
    print(f"page: {page_num}")
        
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, 'lxml')
    
    for result in soup.select(".tF2Cxc"):
        title = result.select_one(".DKV0Md").text
        try:
           snippet = result.select_one(".lEBKkf span").text
        except:
           snippet = None
        links = result.select_one(".yuRUbf a")["href"]
      
        data.append({
          "title": title,
          "snippet": snippet,
          "links": links
        })

    # condition for exiting the loop when the page limit is reached
    if page_num == page_limit:
        break

    # condition for exiting the loop in the absence of the next page button
    if soup.select_one(".d6cvqb a[id=pnnext]"):
        params["start"] += 10
    else:
        break
print(json.dumps(data, indent=2, ensure_ascii=False))

示例输出：

[
  {
    "title": "Cars (2006) - IMDb",
    "snippet": "On the way to the biggest race of his life, a hotshot rookie race car gets stranded in a rundown town, and learns that winning isn't everything in life.",
    "links": "https://www.imdb.com/title/tt0317219/"
  },
  {
    "title": "Cars (film) - Wikipedia",
    "snippet": "Cars is a 2006 American computer-animated sports comedy film produced by Pixar Animation Studios and released by Walt Disney Pictures. The film was directed ...",
    "links": "https://en.wikipedia.org/wiki/Cars_(film)"
  },
  {
    "title": "Cars - Rotten Tomatoes",
    "snippet": "Cars offers visual treats that more than compensate for its somewhat thinly written story, adding up to a satisfying diversion for younger viewers.",
    "links": "https://www.rottentomatoes.com/m/cars"
  },
  other results ...
]

你也可以使用SerpApi的Google Search Engine Results API。这是一个免费的付费API。
不同的是，它将绕过来自Google的块（包括CAPTCHA），无需创建解析器并维护它。
代码示例：

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os

params = {
  "api_key": "...",                  # serpapi key from https://serpapi.com/manage-api-key
  "engine": "google",                # serpapi parser engine
  "q": "cars",                       # search query
  "gl": "uk",                        # country of the search, UK -> United Kingdom
  "num": "100"                       # number of results per page (100 per page in this case)
  # other search parameters: https://serpapi.com/search-api#api-parameters
}

search = GoogleSearch(params)      # where data extraction happens

page_limit = 10
organic_results_data = []
page_num = 0

while True:
    results = search.get_dict()    # JSON -> Python dictionary
    
    page_num += 1
    
    for result in results["organic_results"]:
        organic_results_data.append({
            "title": result.get("title"),
            "snippet": result.get("snippet"),
            "link": result.get("link")
        })

    if page_num == page_limit:
        break
      
    if "next_link" in results.get("serpapi_pagination", []):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
    else:
        break
    
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

输出：

[
  {
    "title": "Rally Cars - Page 30 - Google Books result",
    "snippet": "Some people say rally car drivers are the most skilled racers in the world. Roger Clark, a British rally legend of the 1970s, describes sliding his car down ...",
    "link": "https://books.google.co.uk/books?id=uIOlAgAAQBAJ&pg=PA30&lpg=PA30&dq=cars&source=bl&ots=9vDWFi0bHD&sig=ACfU3U1d4R-ShepjsTtWN-b9SDYkW1sTDQ&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgcEAM"
  },
  {
    "title": "Independent Sports Cars - Page 5 - Google Books result",
    "snippet": "The big three American auto makers produced sports and sports-like cars beginning with GMs Corvette and Fords Thunderbird in 1954. Folowed by the Mustang, ...",
    "link": "https://books.google.co.uk/books?id=HolUDwAAQBAJ&pg=PA5&lpg=PA5&dq=cars&source=bl&ots=yDaDtQSyW1&sig=ACfU3U11nHeRTwLFORGMHHzWjaVHnbLK3Q&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgaEAM"
  }
  other results...
]

赞(0）回复(0）举报 2023-04-27

我来回答

Chrome 自动更改Google搜索页面

2条答案

相关问题

热门标签

最新问答