用python 3和beautifulsoup从亚马逊抓取图片

fcg9iug3  于 2023-02-17  发布在  Python
关注(0)|答案(3)|浏览(152)

我需要刮的主要图像从亚马逊的产品页面。我把ASIN存储到一个列表中,我用循环构建每一个产品页面。我试图刮的图像,但我不能。我尝试用这个代码:

#declare a session object
session = HTMLSession()

#ignore warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")

urls = ['https://www.amazon.it/gp/bestsellers/apparel/', 'https://www.amazon.it/gp/bestsellers/electronics/', 'https://www.amazon.it/gp/bestsellers/books/']
asins = []
for url in urls:
    content = requests.get(url).content
    decoded_content = content.decode()
    asins = re.findall(r'/[^/]+/dp/([^\"?]+)', decoded_content)

#The ASIN Number will be between the dp/ and another /

for asin in asins:
    site = 'https://www.amazon.it/'
    start = 'dp/'
    end = '/'
    url = site + start + asin + end
    resp1 = requests.get(url).content

    soup = bsoup(resp1, "html.parser")
    body = soup.find("body")
    imgtag = soup.find("img", {"id":"landingImage"})
    imageurl = dict(imgtag.attrs)["src"]
    resp2 = request.urlopen(imaegurl)
pu82cl6c

pu82cl6c1#

问题是图像是动态加载的;检查页面,并感谢 BeautifulSoupdocumentation,我能够刮所有需要的图像,给定一个产品。

获取给定链接的页面

我有一个存储数据的类,所以我将页面信息保存在示例中...

import urllib
from bs4 import BeautifulSoup

def take_page(self, url_page):
    req = urllib.request.Request(
        url_page,
        data=None
    )
    f = urllib.request.urlopen(req)
    page = f.read().decode('utf-8')
    self.page = page

抓取图像

下面的简单方法将返回第一个最小的图像

import json

def take_image(self):
    soup = BeautifulSoup(self.page, 'html.parser')
    img_div = soup.find(id="imgTagWrapperId")

    imgs_str = img_div.img.get('data-a-dynamic-image')  # a string in Json format

    # convert to a dictionary
    imgs_dict = json.loads(imgs_str)
    #each key in the dictionary is a link of an image, and the value shows the size (print all the dictionay to inspect)
    num_element = 0 
    first_link = list(imgs_dict.keys())[num_element]
    return first_link

因此,您可以根据需要应用这些方法,我认为这就是您改进代码所需的全部内容。

klh5stk1

klh5stk12#

查看页面上“所有"img的代码示例

for asin in asins:
    site = 'https://www.amazon.it/'
    start = 'dp/'
    end = '/'
    url = site + start + asin + end
    print(url)
    resp1 = requests.get(url).content

    soup = BeautifulSoup(resp1, "html.parser")
    for i in soup.find_all("img"):
        print(i)
2j4z5cfb

2j4z5cfb3#

正确的方法是通过Amazon Affiliate API帐户,但如果你没有帐户的话。下面是使用ScraperAPI lxml的最新代码cssselectPIL
关键部分是dom.cssselect从页面上的元素中获取图像,一个请求代理和使用PIL正确保存图像。在书籍上测试,其他页面将使用更高级别的元素

def save_img(url, name):
    response = requests.get(PROXY + url, stream=True)
    out_path = f'static/bookimg/{name}.jpg'
    try:
        i = Image.open(BytesIO(response.content))
        i.save(out_path)
    except (UnidentifiedImageError, OSError) as e:
        print(e)

def get_img_by_asin(asin, save_name):
    url = PROXY + f'https://www.amazon.co.uk/dp/{asin}/'
    print(url)
    html = requests.get(url).content
    dom = fromstring(html)
    try:
        img = dom.cssselect("#ebooks-img-canvas img")[-1]
        save_img(img.get('src'), save_name)
    except IndexError:
        print('No image or bad response')

https://gist.github.com/fmalina/03c84100e84ecc2ae2cd23d60e11959e

相关问题