python SERP刮与美丽的汤

qnzebej0 于 2023-11-15 发布在 Python

关注(0)|答案(3)|浏览(98)

我试图构建一个简单的脚本来抓取谷歌的第一个搜索结果页面，并将结果导出到. csv中。我设法获取URL和标题，但我无法检索描述。我一直在使用以下代码：

import urllib
import requests
from bs4 import BeautifulSoup

# desktop user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
# mobile user-agent
MOBILE_USER_AGENT = "Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36"

query = "pizza recipe"
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"

headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)

if resp.status_code == 200:
    soup = BeautifulSoup(resp.content, "html.parser")
    results = []
    for g in soup.find_all('div', class_='r'):
        anchors = g.find_all('a')
        if anchors:
            link = anchors[0]['href']
            title = g.find('h3').text
            desc = g.select('span')
            description = g.find('span',{'class':'st'}).text
            item = {
                "title": title,
                "link": link,
                "description": description
            }
            results.append(item)

import pandas as pd
df = pd.DataFrame(results)
df.to_excel("Export.xlsx")

字符串
当我运行代码时，我得到以下消息：

description = g.find('span',{'class':'st'}).text
AttributeError: 'NoneType' object has no attribute 'text'

型
基本上，该字段是空的。
有人能帮我这一行，这样我就可以从片段中获得所有信息吗？

python

来源：https://stackoverflow.com/questions/62047405/serp-scraping-with-beautiful-soup

3条答案

按热度按时间

8xiog9wr1#

它不在div class=“r”内。它在div class=“s”下
因此，更改为描述：

description = g.find_next_sibling("div", class_='s').find('span',{'class':'st'}).text

字符串
从当前元素中，它会找到下一个div，class=“s”。

赞(0）回复(0）举报 2023-11-15

x4shl7ld2#

尝试使用select_one()或select()bs4方法。它们更灵活，更容易阅读。CSS选择器reference。
同样，你可以通过URL params，因为requests为你做的一切喜欢这样：

# instead of this:
query = "pizza recipe"
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"

# try to use this:
params = {
  'q': 'fus ro dah', # query
  'hl': 'en'
}

requests.get('URL', params=params)

字符串
如果要写入.csv，则需要使用.to_csv()而不是.to_excel()
如果要去掉pandas索引列，则可以传递index=False，例如df.to_csv('FILE_NAME', index=False)。
联机IDE中的代码和示例：

import pandas as pd
import requests
from bs4 import BeautifulSoup

headers = {
  "User-agent":
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  'q': 'fus ro dah', # query
  'hl': 'en'
}

resp = requests.get("https://google.com/search", headers=headers, params=params)

if resp.status_code == 200:
    soup = BeautifulSoup(resp.text, "html.parser")

    results = []

    for result in soup.select('.tF2Cxc'):
      title = result.select_one('.DKV0Md').text
      link = result.select_one('.yuRUbf a')['href']
      snippet = result.select_one('#rso .lyLwlc').text

      item = {
        "title": title,
        "link": link,
        "description": snippet
      }

      results.append(item)

df = pd.DataFrame(results)
df.to_csv("BS4_Export.csv", index=False)

型
或者，您也可以使用SerpApi的Google Organic Results API来完成同样的工作。
在您的情况下，不同之处在于您不需要弄清楚要使用什么选择器，以及它们为什么不工作，尽管它们应该工作，因为它已经为最终用户完成了。
要整合的程式码：

from serpapi import GoogleSearch
import os
import pandas as pd

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google",
  "q": "fus ro dah",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()

data = []

for result in results['organic_results']:
  title = result['title']
  link = result['link']
  snippet = result['snippet']

  data.append({
    "title": title,
    "link": link,
    "snippet": snippet
  })

df = pd.DataFrame(results)
df.to_csv("SerpApi_Export.csv", index=False)

型
我写了一篇关于如何刮Google Organic Results的更详细的博客文章。
免责声明，我为SerpApi工作。

赞(0）回复(0）举报 2023-11-15

ckx4rj1h3#

分析Google的页面可能不是最好的方法，因为Google的HTML标签经常随着时间和使用而变化。
结果Google serp API（自定义搜索）也不推荐。因为它的返回不新鲜，有时会返回与长尾关键字无关的结果（我认为API不够聪明，哈哈哈~）
如果你还想使用API在python中访问新鲜的google SERP，我认为你应该考虑第三方API（如serpapi，thruuu~）
如果你只是想要最终的CSV结果，你可以尝试这个免费的应用程序“scohalo serp analyzer“。它可以抓取谷歌搜索引擎并将数据导出到CSV。

赞(0）回复(0）举报 2023-11-15

我来回答

python SERP刮与美丽的汤

3条答案

相关问题

热门标签

最新问答