使用Scrapy进行网页抓取,输出在哪里?

6rvt4ljy  于 2022-11-29  发布在  其他
关注(0)|答案(2)|浏览(190)

我试图建立一个蜘蛛,收集有关创业公司的信息。因此,我写了一个Python脚本与scrapy,应该访问网站和存储在字典中的信息。我认为代码应该从logik的Angular 来看工作,但不知何故,我没有得到任何输出。我的代码:

import scrapy

class StartupsSpider(scrapy.Spider):
    name = 'startups'
    #name of the spider

    allowed_domains = ['www.bmwk.de/Navigation/DE/InvestDB/INVEST-DB_Liste/investdb.html']
    #list of allowed domains

    start_urls = ['https://bmwk.de/Navigation/DE/InvestDB/INVEST-DB_Liste/investdb.html']
    #starting url

    def parse(self, response):
        
        startups = response.xpath('//*[contains(@class,"card-link-overlay")]/@href').getall()
        #parse initial start URL for the specific startup URL

        for startup in startups:
            
            absolute_url =  response.urljoin(startup)

            yield scrapy.Request(absolute_url, callback=self.parse_startup)
            #parse the actual startup information

        next_page_url = response.xpath('//*[@class ="pagination-link"]/@href').get()
        #link to next page
        
        absolute_next_page_url = response.urljoin(next_page_url)
        #go through all pages on start URL
        yield scrapy.Request(absolute_next_page_url)
    
    def parse_startup(self, response):
    #get information regarding startup
        startup_name = response.css('h1::text').get()
        startup_hompage = response.xpath('//*[@class="document-info-item"]/a/@href').get()
        startup_description = response.css('div.document-info-item::text')[16].get()
        branche = response.css('div.document-info-item::text')[4].get()
        founded = response.xpath('//*[@class="date"]/text()')[0].getall()
        employees = response.css('div.document-info-item::text')[9].get()
        capital = response.css('div.document-info-item::text')[11].get()
        applied_for_invest = response.xpath('//*[@class="date"]/text()')[1].getall()

        contact_name = response.css('p.card-title-subtitle::text').get()
        contact_phone = response.css('p.tel > span::text').get()
        contact_mail = response.xpath('//*[@class ="person-contact"]/p/a/span/text()').get()
        contact_address_street = response.xpath('//*[@class ="adr"]/text()').get()
        contact_address_plz = response.xpath('//*[@class ="locality"]/text()').getall()
        contact_state = response.xpath('//*[@class ="country-name"]/text()').get()

        yield{'Startup':startup_name,
              'Homepage': startup_hompage,
              'Description': startup_description,
              'Branche': branche,
              'Gründungsdatum': founded,
              'Anzahl Mitarbeiter':employees,
              'Kapital Bedarf':capital,
              'Datum des Förderbescheids':applied_for_invest,
              'Contact': contact_name,
              'Telefon':contact_phone,
              'E-Mail':contact_mail,
              'Adresse': contact_address_street + contact_address_plz + contact_state}
k10s72fa

k10s72fa1#

1.因为allowed_domains是错误的,所以没有得到输出。
1.在最后一行(Adresse)中,您试图连接liststr类型,因此会出现错误。
1.你的分页链接是错误的,在第一页你得到的是下一页,而在第二页你得到的是上一页。
1.你没有做任何错误检查。在一些页面中,你得到的一些值是None,你试图得到它们的第i个字符,这导致了错误。
我修好了1号2号3号但你得自己修好4号。

import scrapy

class StartupsSpider(scrapy.Spider):
    # name of the spider
    name = 'startups'

    # list of allowed domains
    allowed_domains = ['bmwk.de']

    # starting url
    start_urls = ['https://bmwk.de/Navigation/DE/InvestDB/INVEST-DB_Liste/investdb.html']
    
    def parse(self, response):
        # parse initial start URL for the specific startup URL
        startups = response.xpath('//*[contains(@class,"card-link-overlay")]/@href').getall()

        for startup in startups:
            absolute_url = response.urljoin(startup)

            # parse the actual startup information
            yield scrapy.Request(absolute_url, callback=self.parse_startup)

        # link to next page
        next_page_url = response.xpath('(//*[@class ="pagination-link"])[last()]/@href').get()
        if next_page_url:
            # go through all pages on start URL
            absolute_next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(absolute_next_page_url)

    def parse_startup(self, response):
        # get information regarding startup
        startup_name = response.css('h1::text').get()
        startup_hompage = response.xpath('//*[@class="document-info-item"]/a/@href').get()
        # for example for some of the pages you'll get an error here:
        startup_description = response.css('div.document-info-item::text')[16].get()
        branche = response.css('div.document-info-item::text')[4].get()
        founded = response.xpath('//*[@class="date"]/text()')[0].getall()
        employees = response.css('div.document-info-item::text')[9].get()
        capital = response.css('div.document-info-item::text')[11].get()
        applied_for_invest = response.xpath('//*[@class="date"]/text()')[1].getall()

        contact_name = response.css('p.card-title-subtitle::text').get()
        contact_phone = response.css('p.tel > span::text').get()
        contact_mail = response.xpath('//*[@class ="person-contact"]/p/a/span/text()').get()
        Adresse = ' '.join(response.xpath('//*[@class ="address"]//text()').getall())

        yield {'Startup': startup_name,
               'Homepage': startup_hompage,
               'Description': startup_description,
               'Branche': branche,
               'Gründungsdatum': founded,
               'Anzahl Mitarbeiter': employees,
               'Kapital Bedarf': capital,
               'Datum des Förderbescheids': applied_for_invest,
               'Contact': contact_name,
               'Telefon': contact_phone,
               'E-Mail': contact_mail,
               'Adresse': Adresse}
xlpyo6sf

xlpyo6sf2#

您需要在提示符下运行:scrapy crawl -o文件名。(json或csv)

相关问题