scrapy 抓取电子商务网站www.example.com时出错daraz.pk

nnt7mjpx  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(170)

我试图抓取daraz.pk,但遇到了这个错误。蜘蛛抓取页面上的所有值,直到最后一个值,因为它返回None值,然后蜘蛛抛出一个不可迭代的NoneType对象。我尝试使用异常处理方法,但无论如何都不起作用。如果有人能帮助我,我在这里分享我的代码。我'我使用selenium和scrapy一起来获得物品页面上物品的描述

**

import scrapy
from selenium.webdriver import Chrome
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from ..items import EcomItem
class DarazSpider(scrapy.Spider):
    name = 'daraz'
    def start_requests(self):
        path = 'C:\Program Files (x86)\chromedriver.exe'
        driver = Chrome(executable_path=path)
        driver.get('https://www.daraz.pk/')
        electronics = driver.find_element(By.NAME, 'q')
        electronics.send_keys('Books')
        electronics.send_keys(Keys.RETURN)
        link_elements = driver.find_elements(By.XPATH,'/html/body/div[3]/div/div[2]/div/div/div/div[2]/div/div/div/div[2]/div[2]/a[text()]')
        for link_el in link_elements:
                    href = link_el.text
                    print(href)
    def parse(self, response):
        pass

**

这里是错误

**

Traceback (most recent call last):
    d = crawler.crawl(*args,**kwargs)
  File "C:\Users\Intag\New folder (2)\lib\site-packages\twisted\internet\defer.py", line 1905, in unwindGenerator
    return _cancellableInlineCallbacks(gen)
  File "C:\Users\Intag\New folder (2)\lib\site-packages\twisted\internet\defer.py", line 1815, in _cancellableInlineCallbacks
    _inlineCallbacks(None, gen, status)
--- <exception caught here> ---
  File "C:\Users\Intag\New folder (2)\lib\site-packages\twisted\internet\defer.py", line 1660, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "C:\Users\Intag\New folder (2)\lib\site-packages\scrapy\crawler.py", line 103, in crawl
    start_requests = iter(self.spider.start_requests())
builtins.TypeError: 'NoneType' object is not iterable
2022-08-06 10:29:20 [twisted] CRITICAL:
Traceback (most recent call last):
  File "C:\Users\Intag\New folder (2)\lib\site-packages\twisted\internet\defer.py", line 1660, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "C:\Users\Intag\New folder (2)\lib\site-packages\scrapy\crawler.py", line 103, in crawl
    start_requests = iter(self.spider.start_requests())
TypeError: 'NoneType' object is not iterable

**

w46czmvw

w46czmvw1#

您可以从API获取所需的数据。由于数据是由JAvaScript通过GET方法的API动态加载的,并且数据是json格式的。这是获取数据的超级简单和健壮的方法。

范例:

import scrapy
import json
from scrapy.crawler import CrawlerProcess
class TestSpider(scrapy.Spider):
    name = 'test'

    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1
        }

    def start_requests(self):
        headers= {
            'content-type': 'application/json',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
        }
        api_url='https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1'
        yield scrapy.Request(
            url= api_url,
            method='GET',
            headers=headers,
            callback=self.parse
            )

    def parse(self, response):

        resp = json.loads(response.body)
        for item in resp['mods']['listItems']:
            yield {
                'productUrl':'https:' + item['productUrl']
            } 

if __name__ == "__main__":
    process = CrawlerProcess(TestSpider)
    process.crawl()
    process.start()

输出:

Crawled (200) <GET https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1> (referer: None)   
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/5-i144834997-s1306536157.html?search=1'}        
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/4-i146864039-s1309826616.html?search=1'}        
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/-i229320627-s1449691508.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/-i229571902-s1449944276.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/-i219883778-s1432847877.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/pmc-nmdcat-nums-agha-khan-2022-i209146784-s1415196801.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/nmdcat-bookmbbscommbbscompkpmc-mdcat-practice-books-2022entry-test-preparation-booksentry-test-booksentry-test-preparation-books-2022guide-for-solved-past-paper-papers-exam-exams-test-tests-book-n-books-bnb-multan-ghar-kitab-mkg-new-fareed-fbc-i276082277-s1491310765.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/tenses-made-easy-by-efzal-anware-mufti-i209992860-s1416720338.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/sk-original-golden-13medical-books-in-urdu-i198834812-s1395012400.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/-i242170073-s1461239796.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/-i270001029-s1483708982.html?search=1'}
2022-08-06 12:08:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.pk/catalog/?_keyori=ss&ajax=true&clickTrackInfo=textId--2543448522407782846__abId--296224__pvid--721834c6-aa06-4851-a758-c1dceed517aa__matchType--1__srcQuery--None__spellQuery--books&from=suggest_normal&page=1&q=books&spm=a2a0e.home.search.1.35e34937dlzwzf&sugg=books_0_1>
{'productUrl': 'https://www.daraz.pk/products/css-pms-iqra-ud-din-css-o-css-2022-css-2023-i220043944-s1433189818.html?search=1'}

...等等

相关问题