要遍历所有页面并收集所需数据,您可以使用while循环使用基于非令牌的分页。无论有多少页,它都会遍历所有页面。这意味着它不是硬编码(for i in range(<value>))的分页方式。 另外,请记住,request和soup必须在while循环中,才能通过新页面更新HTML响应结果。 只要存在下一个按钮,就可以分页(由页面上是否存在按钮选择器决定,在我们的例子中是CSS选择器.d6cvqb a[id=pnnext],你需要将["start"]的值增加10才能访问下一个页面,如果存在的话,否则,我们需要退出while循环:
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
在联机IDE中检查完整代码。
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "cars", # query example
"hl": "en", # language
"gl": "uk", # country of the search, UK -> United Kingdom
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
page_limit = 10 # page limit for example
page_num = 0
data = []
# pagination
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
title = result.select_one(".DKV0Md").text
try:
snippet = result.select_one(".lEBKkf span").text
except:
snippet = None
links = result.select_one(".yuRUbf a")["href"]
data.append({
"title": title,
"snippet": snippet,
"links": links
})
# condition for exiting the loop when the page limit is reached
if page_num == page_limit:
break
# condition for exiting the loop in the absence of the next page button
if soup.select_one(".d6cvqb a[id=pnnext]"):
params["start"] += 10
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
示例输出:
[
{
"title": "Cars (2006) - IMDb",
"snippet": "On the way to the biggest race of his life, a hotshot rookie race car gets stranded in a rundown town, and learns that winning isn't everything in life.",
"links": "https://www.imdb.com/title/tt0317219/"
},
{
"title": "Cars (film) - Wikipedia",
"snippet": "Cars is a 2006 American computer-animated sports comedy film produced by Pixar Animation Studios and released by Walt Disney Pictures. The film was directed ...",
"links": "https://en.wikipedia.org/wiki/Cars_(film)"
},
{
"title": "Cars - Rotten Tomatoes",
"snippet": "Cars offers visual treats that more than compensate for its somewhat thinly written story, adding up to a satisfying diversion for younger viewers.",
"links": "https://www.rottentomatoes.com/m/cars"
},
other results ...
]
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
params = {
"api_key": "...", # serpapi key from https://serpapi.com/manage-api-key
"engine": "google", # serpapi parser engine
"q": "cars", # search query
"gl": "uk", # country of the search, UK -> United Kingdom
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params) # where data extraction happens
page_limit = 10
organic_results_data = []
page_num = 0
while True:
results = search.get_dict() # JSON -> Python dictionary
page_num += 1
for result in results["organic_results"]:
organic_results_data.append({
"title": result.get("title"),
"snippet": result.get("snippet"),
"link": result.get("link")
})
if page_num == page_limit:
break
if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
输出:
[
{
"title": "Rally Cars - Page 30 - Google Books result",
"snippet": "Some people say rally car drivers are the most skilled racers in the world. Roger Clark, a British rally legend of the 1970s, describes sliding his car down ...",
"link": "https://books.google.co.uk/books?id=uIOlAgAAQBAJ&pg=PA30&lpg=PA30&dq=cars&source=bl&ots=9vDWFi0bHD&sig=ACfU3U1d4R-ShepjsTtWN-b9SDYkW1sTDQ&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgcEAM"
},
{
"title": "Independent Sports Cars - Page 5 - Google Books result",
"snippet": "The big three American auto makers produced sports and sports-like cars beginning with GMs Corvette and Fords Thunderbird in 1954. Folowed by the Mustang, ...",
"link": "https://books.google.co.uk/books?id=HolUDwAAQBAJ&pg=PA5&lpg=PA5&dq=cars&source=bl&ots=yDaDtQSyW1&sig=ACfU3U11nHeRTwLFORGMHHzWjaVHnbLK3Q&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgaEAM"
}
other results...
]
2条答案
按热度按时间q35jwt9p1#
最好使用Selenium库进行此作业,请阅读此链接:
https://realpython.com/modern-web-automation-with-python-and-selenium/
j5fpnvbx2#
要遍历所有页面并收集所需数据,您可以使用
while
循环使用基于非令牌的分页。无论有多少页,它都会遍历所有页面。这意味着它不是硬编码(for i in range(<value>)
)的分页方式。另外,请记住,request和
soup
必须在while
循环中,才能通过新页面更新HTML响应结果。只要存在下一个按钮,就可以分页(由页面上是否存在按钮选择器决定,在我们的例子中是CSS选择器
.d6cvqb a[id=pnnext]
,你需要将["start"]
的值增加10
才能访问下一个页面,如果存在的话,否则,我们需要退出while循环:在联机IDE中检查完整代码。
示例输出:
你也可以使用SerpApi的Google Search Engine Results API。这是一个免费的付费API。
不同的是,它将绕过来自Google的块(包括CAPTCHA),无需创建解析器并维护它。
代码示例:
输出: