html 用Python3抓取谷歌图片(请求+ BeautifulSoup)

798qvoo8  于 2022-11-27  发布在  Python
关注(0)|答案(3)|浏览(814)

我想使用Google图像搜索下载批量图像。
我的第一个方法;将页面源代码下载到一个文件中,然后用open()打开它可以正常工作,但我希望能够通过运行脚本和更改关键字来获取图像URL。
第一种方法:转到图片搜索(https://www.google.no/search?q=tower&client=opera&hs=UNl&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiM5fnf4_zKAhWIJJoKHYUdBg4Q_AUIBygB&biw=1920&bih=982)。在浏览器中查看页面源代码并将其保存为html文件。当我用脚本对html文件执行open()时,脚本按预期工作,我得到了搜索页面上所有图片的url的整洁列表。这是脚本的第6行所做的(取消注解以测试)。
然而,如果我使用requests.get()函数解析网页,如脚本的第7行所示,它会获取一个 * 不同的 * html文档,该文档不包含图像的完整URL,因此我无法提取它们。
请帮我提取图像的正确网址。
编辑:链接到塔.html,我正在使用:https://www.dropbox.com/s/yy39w1oc8sjkp3u/tower.html?dl=0
这是我迄今为止编写的代码:

import requests
from bs4 import BeautifulSoup

# define the url to be scraped
url = 'https://www.google.no/search?q=tower&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982'

# top line is using the attached "tower.html" as source, bottom line is using the url. The html file contains the source of the above url.
#page = open('tower.html', 'r').read()
page = requests.get(url).text

# parse the text as html
soup = BeautifulSoup(page, 'html.parser')

# iterate on all "a" elements.
for raw_link in soup.find_all('a'):
   link = raw_link.get('href')
   # if the link is a string and contain "imgurl" (there are other links on the page, that are not interesting...
   if type(link) == str and 'imgurl' in link:
        # print the part of the link that is between "=" and "&" (which is the actual url of the image,
        print(link.split('=')[1].split('&')[0])
jum4pzuy

jum4pzuy1#

你要知道:

# http://www.google.com/robots.txt

User-agent: *
Disallow: /search

我想在我的回答之前说,Google非常依赖脚本。很有可能你得到的结果不同,因为你通过reqeusts请求的页面没有使用页面上提供的script做任何事情,而在Web浏览器中加载页面却可以。
Here's what i get when I request the url you supplied
我从requests.get(url).text返回的文本中没有任何地方包含'imgurl'。您的脚本正在将其作为条件的一部分进行查找,但它不存在。
但是我确实看到了一堆<img>标签,其中的src属性设置为图像url。如果这就是你想要的,那么试试这个脚本:

import requests
from bs4 import BeautifulSoup

url = 'https://www.google.no/search?q=tower&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982'

# page = open('tower.html', 'r').read()
page = requests.get(url).text

soup = BeautifulSoup(page, 'html.parser')

for raw_img in soup.find_all('img'):
  link = raw_img.get('src')
  if link:
    print(link)

返回以下结果:

https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQyxRHrFw0NM-ZcygiHoVhY6B6dWwhwT4va727380n_IekkU9sC1XSddAg
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRfuhcCcOnC8DmOfweuWMKj3cTKXHS74XFh9GYAPhpD0OhGiCB7Z-gidkVk
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSOBZ9iFTXR8sGYkjWwPG41EO5Wlcv2rix0S9Ue1HFcts4VcWMrHkD5y10
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTEAZM3UoqqDCgcn48n8RlhBotSqvDLcE1z11y9n0yFYw4MrUFucPTbQ0Ma
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSJvthsICJuYCKfS1PaKGkhfjETL22gfaPxqUm0C2-LIH9HP58tNap7bwc
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQGNtqD1NOwCaEWXZgcY1pPxQsdB8Z2uLGmiIcLLou6F_1c55zylpMWvSo
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSdRxvQjm4KWaxhAnJx2GNwTybrtUYCcb_sPoQLyAde2KMBUhR-65cm55I
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQLVqQ7HLzD7C-mZYQyrwBIUjBRl8okRDcDoeQE-AZ2FR0zCPUfZwQ8Q20
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQHNByVCZzjSuMXMd-OV7RZI0Pj7fk93jVKSVs7YYgc_MsQqKu2v0EP1M0
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcS_RUkfpGZ1xJ2_7DCGPommRiIZOcXRi-63KIE70BHOb6uRk232TZJdGzc
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSxv4ckWM6eg_BtQlSkFP9hjRB6yPNn1pRyThz3D8MMaLVoPbryrqiMBvlZ
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQWv_dHMr5ZQzOj8Ort1gItvLgVKLvgm9qaSOi4Uomy13-gWZNcfk8UNO8
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRRwzRc9BJpBQyqLNwR6HZ_oPfU1xKDh63mdfZZKV2lo1JWcztBluOrkt_o
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQdGCT2h_O16OptH7OofZHNvtUhDdGxOHz2n8mRp78Xk-Oy3rndZ88r7ZA
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRnmn9diX3Q08e_wpwOwn0N7L1QpnBep1DbUFXq0PbnkYXfO0wBy6fkpZY
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSaP9Ok5n6dL5K1yKXw0TtPd14taoQ0r3HDEwU5F9mOEGdvcIB0ajyqXGE
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTcyaCvbXLYRtFspKBe18Yy5WZ_1tzzeYD8Obb-r4x9Yi6YZw83SfdOF5fm
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTnS1qCjeYrbUtDSUNcRhkdO3fc3LTtN8KaQm-rFnbj_JagQEPJRGM-DnY0
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSiX_elwJQXGlToaEhFD5j2dBkP70PYDmA5stig29DC5maNhbfG76aDOyGh
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQb3ughdUcPUgWAF6SkPFnyiJhe9Eb-NLbEZl_r7Pvt4B3mZN1SVGv0J-s
zpgglvta

zpgglvta2#

您可以使用'data-src'或'src'属性来寻找属性。

REQUEST_HEADER = {
    'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}

def get_images_new(self, prod_id, name, header, **kw):
        i=1
        man_code = "apple" #anything you want to search for
        url = "https://www.google.com.au/search?q=%s&source=lnms&tbm=isch" % man_code
        _logger.info("Subitemsyyyyyyyyyyyyyy: %s" %url)
        response = urlopen(Request(url, headers={
            'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}))
        html = response.read().decode('utf-8')
        soup = BeautifulSoup(html, "html.parser")
        image_elements = soup.find_all("img", {"class": "rg_i Q4LuWd"})
        for img in image_elements:
            #temp1 = img.get('src')
            #_logger.info("11111[%s]" % (temp1))
            temp = img.get('data-src')
            if temp and i < 7:
                image = temp
                #_logger.error("11111[%s]" % (image))
                filename = str(i)
                if filename:
                    path = "/your/directory/" + str(prod_id) # your filename
                    if not os.path.exists(path):
                        os.mkdir(path)
                    _logger.error("ath.existath.existath.exist[%s]" % (image))
                    imagefile = open(path + "/" + filename + ".png", 'wb+')
                    req = Request(image, headers=REQUEST_HEADER)
                    resp = urlopen(req)
                    imagefile.write(resp.read())
                    imagefile.close()
                i += 1
8e2ybdfx

8e2ybdfx3#

你可以使用regular expressions来提取Google图片,因为你需要的数据是动态呈现的,但我们可以在内联JSON中找到它。
为此,我们可以在页面源代码(Ctrl+U)中搜索第一个图像标题,找到我们需要的匹配项,如果<script>>元素中有匹配项,那么它很可能是一个内联JSON。
为了找到原始图像,我们首先需要找到缩略图,然后我们需要减去部分解析后的Inline JSON,这将给予一种更简单的方法来解析原始分辨率的图像:

# https://regex101.com/r/SxwJsW/1
matched_google_images_thumbnails = ", ".join(
    re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                       str(matched_google_image_data))).split(", ")
    
thumbnails = [bytes(bytes(thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for thumbnail in matched_google_images_thumbnails]
    
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
        r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', "", str(matched_google_image_data))
    
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", removed_matched_google_images_thumbnails)

不幸的是,这种方法不可能找到所有的图片,因为它们是通过滚动添加到页面上的。如果你需要收集所有的图片,你需要使用浏览器自动化,如seleniumplaywright,如果你不想反向工程的话。
有一个"ijn" URL parameter定义了要获取的页码(大于或等于0),它与同样位于内联JSON中的分页标记结合使用。
在联机IDE中检查代码。

import requests, re, json, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
}

google_images = []

params = {    
    "q": "tower",              # search query
    "tbm": "isch",             # image results
    "hl": "en",                # language of the search
    "gl": "us"                # country where search comes fro
}
    
html = requests.get("https://google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
    
all_script_tags = soup.select("script")
    
# https://regex101.com/r/RPIbXK/1
matched_images_data = "".join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
    
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
      
# https://regex101.com/r/NRKEmV/1
matched_google_image_data = re.findall(r'\"b-GRID_STATE0\"(.*)sideChannel:\s?{}}', matched_images_data_json)
    
# https://regex101.com/r/SxwJsW/1
matched_google_images_thumbnails = ", ".join(
    re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                       str(matched_google_image_data))).split(", ")
    
thumbnails = [bytes(bytes(thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for thumbnail in matched_google_images_thumbnails]
    
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
        r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', "", str(matched_google_image_data))
    
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", removed_matched_google_images_thumbnails)
    
full_res_images = [
        bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in matched_google_full_resolution_images
]
        
for index, (metadata, thumbnail, original) in enumerate(zip(soup.select('.isv-r.PNCib.MSM1fd.BUooTd'), thumbnails, full_res_images), start=1):
    google_images.append({
        "title": metadata.select_one(".VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb")["title"],
        "link": metadata.select_one(".VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb")["href"],
        "source": metadata.select_one(".fxgdke").text,
        "thumbnail": thumbnail,
        "original": original
    })

print(json.dumps(google_images, indent=2, ensure_ascii=False))

输出示例:

[
  {
    "title": "Eiffel Tower - Wikipedia",
    "link": "https://en.wikipedia.org/wiki/Eiffel_Tower",
    "source": "Wikipedia",
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTsuYzf9os1Qb1ssPO6fWn-5Jm6ASDXAxUFYG6eJfvmehywH-tJEXDW0t7XLR3-i8cNd-0&usqp=CAU",
    "original": "https://upload.wikimedia.org/wikipedia/commons/thumb/8/85/Tour_Eiffel_Wikimedia_Commons_%28cropped%29.jpg/640px-Tour_Eiffel_Wikimedia_Commons_%28cropped%29.jpg"
  },
  {
    "title": "tower | architecture | Britannica",
    "link": "https://www.britannica.com/technology/tower",
    "source": "Encyclopedia Britannica",
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR8EsWofNiFTe6alwRlwXVR64RdWTG2fuBQ0z1FX4tg3HbL7Mxxvz6GnG1rGZQA8glVNA4&usqp=CAU",
    "original": "https://cdn.britannica.com/51/94351-050-86B70FE1/Leaning-Tower-of-Pisa-Italy.jpg"
  },
  {
    "title": "Tower - Wikipedia",
    "link": "https://en.wikipedia.org/wiki/Tower",
    "source": "Wikipedia",
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT3L9LA0VamqmevhCtkrHZvM9MlBf9EjtTT7KhyzRP3zi3BmuCOmn0QFQG42xFfWljcsho&usqp=CAU",
    "original": "https://upload.wikimedia.org/wikipedia/commons/3/3e/Tokyo_Sky_Tree_2012.JPG"
  },
  # ...
]

你也可以使用SerpApi的Google Images API。这是一个免费的付费API。不同的是它会绕过Google的块(包括CAPTCHA),不需要创建解析器和维护它。
简单代码示例:

from serpapi import GoogleSearch
import os, json

image_results = []
   
# search query parameters
params = {
    "engine": "google",               # search engine. Google, Bing, Yahoo, Naver, Baidu...
    "q": "tower",                     # search query
    "tbm": "isch",                    # image results
    "num": "100",                     # number of images per page
    "ijn": 0,                         # page number: 0 -> first page, 1 -> second...
    "api_key": os.getenv("API_KEY")   # your serpapi api key
                                      # other query parameters: hl (lang), gl (country), etc  
}
    
search = GoogleSearch(params)         # where data extraction happens
    
images_is_present = True
while images_is_present:
    results = search.get_dict()       # JSON -> Python dictionary
    
# checks for "Google hasn't returned any results for this query."
    if "error" not in results:
        for image in results["images_results"]:
            if image["original"] not in image_results:
                    image_results.append(image["original"])
                
# update to the next page
        params["ijn"] += 1
    else:
        images_is_present = False
        print(results["error"])

print(json.dumps(image_results, indent=2))

输出量:

[
  "https://cdn.rt.emap.com/wp-content/uploads/sites/4/2022/08/10084135/shutterstock-woods-bagot-rough-site-for-leadenhall-tower.jpg",
  "https://dynamic-media-cdn.tripadvisor.com/media/photo-o/1c/60/ff/c5/ambuluwawa-tower-is-the.jpg?w=1200&h=-1&s=1",
  "https://cdn11.bigcommerce.com/s-bf3bb/product_images/uploaded_images/find-your-nearest-cell-tower-in-five-minutes-or-less.jpeg",
  "https://s3.amazonaws.com/reuniontower/Reunion-Tower-Exterior-Skyline.jpg",
  "https://assets2.rockpapershotgun.com/minecraft-avengers-tower.jpg/BROK/resize/1920x1920%3E/format/jpg/quality/80/minecraft-avengers-tower.jpg",
  "https://images.adsttc.com/media/images/52ab/5834/e8e4/4e0f/3700/002e/large_jpg/PERTAMINA_1_Tower_from_Roundabout.jpg?1386960835",
  "https://awoiaf.westeros.org/images/7/78/The_tower_of_joy_by_henning.jpg",
  "https://eu-assets.simpleview-europe.com/plymouth2016/imageresizer/?image=%2Fdmsimgs%2Fsmeatontower3_606363908.PNG&action=ProductDetailNew",
  # ...
]

如果您需要更多的代码解释,可以参考Scrape and download Google Images with Python博客文章。

相关问题