scrapy 如何使用python抓取谷歌Map

zbsbpyhn  于 2022-11-09  发布在  Python
关注(0)|答案(3)|浏览(236)

我试图用python从谷歌Map上抓取一个地方的评论数。例如,餐厅派克登陆(见下面的谷歌Map网址)有162条评论。我想在python中拉这个数字。
网址:https://www.google.com/maps?cid=15423079754231040967
我不是很精通HTML,但从一些基本的例子在互联网上我写了下面的代码,但我得到的是一个黑色变量后运行这段代码。如果你能让我知道我在这方面做错了什么,将不胜感激。

from urllib.request import urlopen
from bs4 import BeautifulSoup

quote_page ='https://www.google.com/maps?cid=15423079754231040967'
page = urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
price_box = soup.find_all('button',attrs={'class':'widget-pane-link'})
print(price_box.text)
vxqlmq5t

vxqlmq5t1#

在没有API的情况下,用纯Python很难做到这一点,下面是我最后的结果(注意,我在url的末尾添加了&hl=en,以获得英语结果,而不是我的语言):

import re
import requests
from ast import literal_eval

urls = [
'https://www.google.com/maps?cid=15423079754231040967&hl=en',
'https://www.google.com/maps?cid=16168151796978303235&hl=en']

for url in urls:
    for g in re.findall(r'\[\\"http.*?\d+ reviews?.*?]', requests.get(url).text):
        data = literal_eval(g.replace('null', 'None').replace('\\"', '"'))
        print(bytes(data[0], 'utf-8').decode('unicode_escape'))
        print(data[1])

印刷品:

http://www.google.com/search?q=Pike's+Landing,+4438+Airport+Way,+Fairbanks,+AK+99709,+USA&ludocid=15423079754231040967#lrd=0x51325b1733fa71bf:0xd609c9524d75cbc7,1
469 reviews
http://www.google.com/search?q=Sequoia+TreeScape,+Newmarket,+ON+L3Y+8R5,+Canada&ludocid=16168151796978303235#lrd=0x882ad2157062b6c3:0xe060d065957c4103,1
42 reviews
kg7wmglp

kg7wmglp2#

您需要查看页面的源代码,并使用正则表达式解析window.APP_INITIALIZATION_STATE变量块,在那里您会找到所有需要的数据。
或者,您可以使用SerpApi的Google Maps Reviews API
JSON输出示例:

"place_results": {
  "title": "Pike's Landing",
  "data_id": "0x51325b1733fa71bf:0xd609c9524d75cbc7",
  "reviews_link": "https://serpapi.com/search.json?engine=google_maps_reviews&hl=en&place_id=0x51325b1733fa71bf%3A0xd609c9524d75cbc7",
  "gps_coordinates": {
    "latitude": 64.8299557,
    "longitude": -147.8488774
  },
  "place_id_search": "https://serpapi.com/search.json?data=%214m5%213m4%211s0x51325b1733fa71bf%3A0xd609c9524d75cbc7%218m2%213d64.8299557%214d-147.8488774&engine=google_maps&google_domain=google.com&hl=en&type=place",
  "thumbnail": "https://lh5.googleusercontent.com/p/AF1QipNtwheOCQ97QFrUNIwKYUoAPiV81rpiW5cIiQco=w152-h86-k-no",
  "rating": 3.9,
  "reviews": 839,
  "price": "$$",
  "type": [
    "American restaurant"
  ],
  "description": "Burgers, seafood, steak & river views. Pub fare alongside steak & seafood, served in a dining room with river views & a waterfront patio.",
  "service_options": {
    "dine_in": true,
    "curbside_pickup": true,
    "delivery": false
  }
}

要集成的代码:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google_maps",
    "type": "search",
    "q": "pike's landing",
    "ll": "@40.7455096,-74.0083012,14z",
    "google_domain": "google.com",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

reviews = results["place_results"]["reviews"]

print(reviews)

输出量:

839

免责声明,我为SerpApi工作。

qojgxg4l

qojgxg4l3#

在没有浏览器或代理的情况下刮取谷歌Map会在几次成功请求后导致阻塞,因此刮取谷歌Map的主要问题是处理cookie和ReCaptcha。
这是一个很好的post,你可以看到一个在python中使用selenium实现同样目的的例子,一般的想法是启动一个浏览器,模拟用户在网站上做什么。
另一种方法是使用一些可靠的第三方服务,它会为你做所有的工作并返回结果。例如,你可以尝试Outscraper's Reviews service和一个免费的层。

from outscraper import ApiClient

api_client = ApiClient(api_key='SECRET_API_KEY')

# Get reviews of the specific place by id

result = api_client.google_maps_reviews('ChIJrc9T9fpYwokRdvjYRHT8nI4', reviewsLimit=20, language='en')

# Get reviews for places found by search query

result = api_client.google_maps_reviews('Memphis Seoul brooklyn usa', reviewsLimit=20, limit=500, language='en')

# Get only new reviews during last 24 hours

from datetime import datetime, timedelta
yesterday_timestamp = int((datetime.now() - timedelta(1)).timestamp())

result = api_client.google_maps_reviews(
    'ChIJrc9T9fpYwokRdvjYRHT8nI4', sort='newest', cutoff=yesterday_timestamp, reviewsLimit=100, language='en')

免责声明,我为Outscraper工作。

相关问题