selenium 使用Selify和Python抓取Google应用程序的所有评论

tjvv9vkg  于 2022-11-10  发布在  Python
关注(0)|答案(3)|浏览(211)

我想从Google Play商店刮掉特定应用程序的所有评论。我准备了以下脚本:


# App Reviews Scraper

import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

from bs4 import BeautifulSoup

url = "https://play.google.com/store/apps/details?id=com.android.chrome&hl=en&showAllReviews=true"

# make request

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(url)
SCROLL_PAUSE_TIME = 5

# Get scroll height

last_height = driver.execute_script("return document.body.scrollHeight")
time.sleep(SCROLL_PAUSE_TIME)

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")

    if new_height == last_height:
        break
    last_height = new_height

# Get everything inside <html> tag including javscript

html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
soup = BeautifulSoup(html, 'html.parser')

reviewer = []
date = []

# review text

for span in soup.find_all("span", class_="X43Kjb"):
    reviewer.append(span.text)

# review date

for span in soup.find_all("span", class_="p2TkOb"):
    date.append(span.text)

print(len(reviewer))
print(len(date))

但是,它始终只显示203。有35,474,218条评论。那么,我如何下载所有的评论呢?

d5vmydt9

d5vmydt91#

wait=WebDriverWait(driver,1)

try:
    wait.until(EC.element_to_be_clickable((By.XPATH,"//span[text()='Show More']"))).click()
except:
    continue

只需添加此选项,即可检查您的无限滚动中是否显示更多元素。
导入:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC
rdrgkggo

rdrgkggo2#

更轻松地从Play商店抓取APP数据

!pip install google_play_scraper 

from google_play_scraper import app

# US Market Google play store reviews

from google_play_scraper import Sort, reviews_all
 us_reviews = reviews_all(
'add the app id here-using the string mentioned after id value in your code', # use the id from the play 
 store hyperlink that you have used above
 sleep_milliseconds=0, # defaults to 0
 lang='en', # defaults to 'en, can change to other lang as well'
 country='us', # defaults to 'us'
 sort=Sort.NEWEST, # defaults to Sort.MOST_RELEVANT

)

转换成 Dataframe

df = pd.DataFrame(np.array(us_reviews ),columns=['review'])
df = df.join(pd.DataFrame(df.pop('review').tolist()))
v2g6jxz6

v2g6jxz63#

我认为,由于谷歌的限制,没有办法提取所有评论。例如,com.collectorz.javamobile.android.books APP有2470条评论,滚动到评论末尾实际显示879条,减少了64.41%的变化。
Calculation示例:

(879 - 2470)/2470 = -64.41% (64.41% decrease)

在滚动到评论的最后时,在Chrome开发工具中:

$$(".X5PpBb")
[0 … 99]
[100 … 199]
[200 … 299]
[300 … 399]
[400 … 499]
[500 … 599]
[600 … 699]
[700 … 799]
[800 … 878]
length: 879 👈👈👈

在新的用户界面中,出现了一个显示更多按钮,执行可能会停止/停滞或抛出错误,从而减少审查。
要提取所有可用的数据,您需要检查查看所有评论按钮是否存在。如果该应用程序的评论很少或根本没有评论,则该按钮可能会缺失。如果该按钮存在,则需要单击该按钮并等待数据加载:


# if "See all reviews" button present

if driver.find_element(By.CSS_SELECTOR, ".Jwxk6d .u4ICaf button"):
    # clicking on the button
    button = driver.find_element(By.CSS_SELECTOR, ".Jwxk6d .u4ICaf button")
    driver.execute_script("arguments[0].click();", button)

    # waiting a few sec to load comments
    time.sleep(4)

加载数据后,您需要滚动页面。您可以对页面滚动算法进行小小的更改。如果变量new_heightold_height相等,则程序将查找显示更多按钮选择器。如果此按钮存在,则程序将单击该按钮并继续执行下一步:

if new_height == old_height:
    try:
        show_more = driver.find_element(By.XPATH, "//span[text()='Show More']")
        driver.execute_script("arguments[0].click();", show_more)
        time.sleep(1)
    except:
        break

在线IDE中的代码和完整示例:

import time, lxml, re, json
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup

URL = "https://play.google.com/store/apps/details?id=com.collectorz.javamobile.android.books&hl=en"

service = Service(ChromeDriverManager().install())

options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--lang=en")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Chrome(service=service, options=options)
driver.get(URL)

# if "See all reviews" button present

if driver.find_element(By.CSS_SELECTOR, ".Jwxk6d .u4ICaf button"):
    # clicking on the button
    button = driver.find_element(By.CSS_SELECTOR, ".Jwxk6d .u4ICaf button")
    driver.execute_script("arguments[0].click();", button)

    # waiting a few sec to load comments
    time.sleep(4)

    old_height = driver.execute_script("""
        function getHeight() {
            return document.querySelector('.fysCi').scrollHeight;
        }
        return getHeight();
    """)

    # scrolling
    while True:
        driver.execute_script("document.querySelector('.fysCi').scrollTo(0, document.querySelector('.fysCi').scrollHeight)")
        time.sleep(1)

        new_height = driver.execute_script("""
            function getHeight() {
                return document.querySelector('.fysCi').scrollHeight;
            }
            return getHeight();
        """)

        if new_height == old_height:
            try:
                # if "Show More" button present
                show_more = driver.find_element(By.XPATH, "//span[text()='Show More']")
                driver.execute_script("arguments[0].click();", show_more)
                time.sleep(1)
            except:
                break

        old_height = new_height

    # done scrolling
    soup = BeautifulSoup(driver.page_source, 'lxml')
    driver.quit()

    user_comments = []

    # exctracting comments
    for index, comment in enumerate(soup.select(".RHo1pe"), start=1):
        comment_likes = comment.select_one(".AJTPZc")   

        user_comments.append({
            "position": index,
            "user_name": comment.select_one(".X5PpBb").text,
            "user_avatar": comment.select_one(".gSGphe img").get("srcset").replace(" 2x", ""),
            "user_comment": comment.select_one(".h3YV2d").text,
            "comment_likes": comment_likes.text.split("people")[0].strip() if comment_likes else None,
            "app_rating": re.search(r"\d+", comment.select_one(".iXRFPc").get("aria-label")).group(),
            "comment_date": comment.select_one(".bp9Aid").text
        })

    print(json.dumps(user_comments, indent=2, ensure_ascii=False))

如果您想要更快地提取评论,可以使用SerpApi中的Google Play Product Reviews API。它将绕过搜索引擎的块,您不必从头开始创建和维护解析器。
对所有页面进行分页并提取评论的代码示例:

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import os, json

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    'api_key': os.getenv('API_KEY'),                            # your serpapi api
    "engine": "google_play_product",                            # serpapi parsing engine
    "store": "apps",                                            # app results
    "gl": "us",                                                 # country of the search
    "hl": "en",                                                 # language of the search
    "product_id": "com.collectorz.javamobile.android.books"     # app id
}

search = GoogleSearch(params)       # where data extraction happens on the backend

reviews = []

while True:
    results = search.get_dict()     # JSON -> Python dict

    for review in results["reviews"]:
        reviews.append({
            "title": review.get("title"),
            "avatar": review.get("avatar"),
            "rating": review.get("rating"),
            "likes": review.get("likes"),
            "date": review.get("date"),
            "snippet": review.get("snippet"),
            "response": review.get("response")
        })

    # pagination
    if "next" in results.get("serpapi_pagination", {}):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination", {}).get("next")).query)))
    else:
        break

print(json.dumps(reviews, indent=2, ensure_ascii=False))

有一个Scrape All Google Play App Reviews in Python博客,详细介绍了如何提取所有评论。
免责声明我为SerpApi工作。

相关问题