使用JSON应用程序抓取Python网站

6kkfgxo0  于 2023-01-22  发布在  Python
关注(0)|答案(1)|浏览(118)

我试图在下面的网址中获得网站上一个项目的价格。但是,我发现一些问题时,看网站的源页面。
网址是:https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love
源代码页面中我感兴趣的部分如下(我猜):

<script type="application/ld+json">
    [{

"@context":"http://schema.org",
"@type":"Product",
"productID":"25372685655708131",
"name":"LOVE bracelet, small model",
"description":"#LOVE# bracelet, small model, yellow gold 750/1000. Supplied with a screwdriver. Width: 3.65 mm (for size 17). Now available in a slimmer version, Cartier continues to write the story of the #LOVE# bracelet. Same design, same oval shape, same story: a timeless – yet slightly slimmer – creation which is fastened using a screwdriver. The closure is designed with a functional screw on one side of the bracelet and a hinge on the other. To determine the size of your #LOVE# bracelet, measure your wrist, adding one centimetre to your size for a tighter fit, or two centimetres for a looser fit.",
"image":["https://www.cartier.com/variants/images/25372685655708131/img1/w960.jpg"],
"offers": 
[{"@type":"Offer","availability":"http://schema.org/InStock","priceCurrency":"GBP","price":"4100","sku":"0400574782829","url":"https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html"}]}]
    </script>

我尝试了以下步骤:

import json
from bs4 import BeautifulSoup
import requests
from multiprocessing import Pool
import pandas as pd

data = {'url':[],'offers_price':[]}

def get_price(url):
    soup = BeautifulSoup(requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).content, "html.parser")
    data = json.loads(soup.find_all('script', {'type': 'application/ld+json'})[-1].get_text())
    return url, int(data['offers']['price'])

if __name__ == '__main__':

    urls = ['https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love']

    with Pool(processes=4) as pool:
            for url, price in pool.imap_unordered(get_price, urls):
                    data['offers_price'].append(price)
                    data['url'].append(url)
    print(data)

但不成功。你会怎么处理这个案子?

aor9mmx1

aor9mmx11#

我能够得到价格,但我得到了它从product-price标签:

import json
from bs4 import BeautifulSoup
import requests
from multiprocessing import Pool
import pandas as pd

data = {'url':[],'offers_price':[]}

def get_price(url):
    soup = BeautifulSoup(requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).content, "html.parser")
    data = json.loads(soup.find_all('product-price')[-1]['data-model'])
    return url, int(data['fullPrice'])

if __name__ == '__main__':

    urls = ['https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love']

    with Pool(processes=4) as pool:
            for url, price in pool.imap_unordered(get_price, urls):
                    data['offers_price'].append(price)
                    data['url'].append(url)
    print(data)

输出:

{'url': ['https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love'], 'offers_price': [4100]}

顺便问一下,你确定要附加网址和价格吗?我认为你应该这样做:

data['offers_price'] = price
data['url'] = url

相关问题