我试图建立一个蜘蛛,收集有关创业公司的信息。因此,我写了一个Python脚本与scrapy,应该访问网站和存储在字典中的信息。我认为代码应该从logik的Angular 来看工作,但不知何故,我没有得到任何输出。我的代码:
import scrapy
class StartupsSpider(scrapy.Spider):
name = 'startups'
#name of the spider
allowed_domains = ['www.bmwk.de/Navigation/DE/InvestDB/INVEST-DB_Liste/investdb.html']
#list of allowed domains
start_urls = ['https://bmwk.de/Navigation/DE/InvestDB/INVEST-DB_Liste/investdb.html']
#starting url
def parse(self, response):
startups = response.xpath('//*[contains(@class,"card-link-overlay")]/@href').getall()
#parse initial start URL for the specific startup URL
for startup in startups:
absolute_url = response.urljoin(startup)
yield scrapy.Request(absolute_url, callback=self.parse_startup)
#parse the actual startup information
next_page_url = response.xpath('//*[@class ="pagination-link"]/@href').get()
#link to next page
absolute_next_page_url = response.urljoin(next_page_url)
#go through all pages on start URL
yield scrapy.Request(absolute_next_page_url)
def parse_startup(self, response):
#get information regarding startup
startup_name = response.css('h1::text').get()
startup_hompage = response.xpath('//*[@class="document-info-item"]/a/@href').get()
startup_description = response.css('div.document-info-item::text')[16].get()
branche = response.css('div.document-info-item::text')[4].get()
founded = response.xpath('//*[@class="date"]/text()')[0].getall()
employees = response.css('div.document-info-item::text')[9].get()
capital = response.css('div.document-info-item::text')[11].get()
applied_for_invest = response.xpath('//*[@class="date"]/text()')[1].getall()
contact_name = response.css('p.card-title-subtitle::text').get()
contact_phone = response.css('p.tel > span::text').get()
contact_mail = response.xpath('//*[@class ="person-contact"]/p/a/span/text()').get()
contact_address_street = response.xpath('//*[@class ="adr"]/text()').get()
contact_address_plz = response.xpath('//*[@class ="locality"]/text()').getall()
contact_state = response.xpath('//*[@class ="country-name"]/text()').get()
yield{'Startup':startup_name,
'Homepage': startup_hompage,
'Description': startup_description,
'Branche': branche,
'Gründungsdatum': founded,
'Anzahl Mitarbeiter':employees,
'Kapital Bedarf':capital,
'Datum des Förderbescheids':applied_for_invest,
'Contact': contact_name,
'Telefon':contact_phone,
'E-Mail':contact_mail,
'Adresse': contact_address_street + contact_address_plz + contact_state}
2条答案
按热度按时间k10s72fa1#
1.因为
allowed_domains
是错误的,所以没有得到输出。1.在最后一行(
Adresse
)中,您试图连接list
和str
类型,因此会出现错误。1.你的分页链接是错误的,在第一页你得到的是下一页,而在第二页你得到的是上一页。
1.你没有做任何错误检查。在一些页面中,你得到的一些值是
None
,你试图得到它们的第i个字符,这导致了错误。我修好了1号2号3号但你得自己修好4号。
xlpyo6sf2#
您需要在提示符下运行:scrapy crawl -o文件名。(json或csv)