为什么使用python和scrapy或bs4都不能抓取网站的某个部分?

deikduxw  于 2022-11-09  发布在  Python
关注(0)|答案(2)|浏览(139)

我正在努力刮下面的网站:https://oxolabs.eu/#portfolio
我正在寻找刮的信息是公司的URL的形式投资组合部分。我已经尝试了第一次与Scrapy,但它返回这个(网站是爬,但不是刮):
2022-07-28 11:46:03 [报废.核心.引擎]调试:抓取(200)〈获取https://oxolabs.eu/?status=funded#portfolio〉(参考:无)2022-07-28 11:46:03 [scrapy.核心.引擎]信息:闭合星形轮(已完成)
Beautifulsoup返回了除了投资组合部分中的URL之外的所有URL。
谁能解释一下为什么那部分没有被刮,我怎么能刮呢?
我的美丽的汤脚本:

from cgitb import text
from re import A
from bs4 import BeautifulSoup
import requests

url = "https://oxolabs.eu/?status=funded#portfolio"
ua={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
r = requests.get(url, headers=ua, verify=False)
soup = BeautifulSoup(r.text, features="lxml")

for link in soup.find_all('a'):
    print(link.get('href'))

我还附上了我与Scrapy一起使用的脚本:

import scrapy

class StupsbSpider(scrapy.Spider):
    name = 'stupsb'
    allowed_domains = ['oxolabs.eu/']
    start_urls = ['https://oxolabs.eu/?status=funded#portfolio']

    def parse(self, response):
        startups = response.xpath("//section[@class='oxo-section oxo-portfolio']")
        for startup in startups:
            # name = startup.xpath(".//a[@class='portfolio-entry-media-link']/@title").getall(),
            # industry = startup.xpath(".//div[@class='text-block-6']//text()").get(),
            url = startup.xpath("//section[@class='oxo-section oxo-portfolio']//@href").getall()
            yield{
                'url' : url,
            }
eagi6jfj

eagi6jfj1#

您需要的数据是使用javascipt从API动态加载的,而您只是试图获取尚未加载到DOM中的链接。如果您希望抓取这些数据,那么我会考虑使用Selenium作为无头抓取器。
如果是我,有时候你不需要通过抓取来获取数据,为什么不直接在这个链接上使用请求:
https://api.oxoservices.eu/api/v1/startups?site=labs&startup_status=funded
然后,您可以将查询字符串startup_status调整为fundedacceleratingexited。您要查找的数据经过格式化,没有任何限制,您可以使用它从JSON有效负载中获取所需的图像或其他数据。
作为入门示例:

import json
import requests

resp = requests.get('https://api.oxoservices.eu/api/v1/startups?site=labs&startup_status=funded')

json_resp = json.loads(resp.text)

for company in json_resp['data']:
    print(json.dumps(company, indent=4))
    print()

这将给予你一个创业公司的列表,每个公司看起来像这样:

{
    "id": 1047,
    "name": "Betme",
    "photo": {
        "id": "d800cf0b-7772-4f9a-a7fc-3563976aa292",
        "filename": "6f85f02d55c7db098a2cd141bf2b4c60.png",
        "mime": "image/png",
        "type": "photo",
        "size": 47951,
        "url": "/attachments/d800cf0b-7772-4f9a-a7fc-3563976aa292",
        "created_at": "2021-04-01T03:46:01.000000Z"
    },
    "photo_id": "d800cf0b-7772-4f9a-a7fc-3563976aa292",
    "cover": null,
    "cover_id": null,
    "focus_id": 25,
    "focus": {
        "id": 25,
        "name": "E-Sport/E-Gaming",
        "color": "rgb(138, 102, 73)",
        "is_active": true,
        "created_at": "2019-09-23T16:50:43.000000Z",
        "updated_at": null
    },
    "startup_stage_id": 1,
    "website": "https://www.betmegaming.com",
    "video_id": null,
    "summary": "A Betme egy applik\u00e1ci\u00f3 form\u00e1j\u00e1ban \u00faj\u00edtja meg az e-gaming vil\u00e1g\u00e1t. K\u00f6z\u00f6ss\u00e9gi megold\u00e1sainak k\u00f6sz\u00f6nhet\u0151en a j\u00e1t\u00e9kosok p\u00e9nzkereseti lehet\u0151s\u00e9ghez jutnak.",
    "video_type_id": "1",
    "startup_status": {
        "id": 5,
        "key": "funded",
        "name": "Funded"
    },
    "startup_investment_type": {
        "id": 3,
        "key": "seed",
        "name": "Seed"
    },
    "startup_valuation_basis": null,
    "raised_type": {
        "id": 1,
        "key": "none",
        "name": "Not seeking"
    },
    "is_active": false,
    "irr": 0,
    "created_at": "2020-07-23T16:40:13.000000Z"
}

通常,使用这样的数据是一种更有效、更简单的方法,因为它已经是一种结构化的格式。

ie3xauqp

ie3xauqp2#

JavaScript只加载图像,而其余所需的数据是静态的。

范例:

import scrapy
class StupsbSpider(scrapy.Spider):
    name = 'stupsb'
    start_urls = ['https://oxolabs.eu/?status=funded#portfolio']

    def parse(self, response):
        startups = response.xpath('//*[@class="oxo-grid grid-25"]/a/@href')
        for startup in startups:
            yield{
                'url' : startup.get()
            }

输出:

{'url': 'https://www.linkedin.com/pub/peter-oszk%c3%b3/25/705/3b3'}
2022-07-28 16:37:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://oxolabs.eu/?status=funded>
{'url': 'https://www.linkedin.com/in/rita-j%C3%A1noska/'}
2022-07-28 16:37:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://oxolabs.eu/?status=funded>
{'url': 'https://www.linkedin.com/in/gergely-balogh-1bbb3573/'}
2022-07-28 16:37:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://oxolabs.eu/?status=funded>
{'url': 'https://www.linkedin.com/in/orsolya-csetri-940b5721/'}
2022-07-28 16:37:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://oxolabs.eu/?status=funded>
{'url': ''}
2022-07-28 16:37:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://oxolabs.eu/?status=funded>
{'url': 'https://www.linkedin.com/in/marai-m%C3%B3nika-klaudia-048973193/'}
2022-07-28 16:37:02 [scrapy.core.engine] INFO: Closing spider (finished)
2022-07-28 16:37:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
 'downloader/response_status_count/200': 1,
 'item_scraped_count': 6,

相关问题