通过Scrapy中的Meta从连续的并行解析函数中产生值

在我的scrapy代码中，我试图从议会的网站上获得以下数字，其中列出了所有议会成员（MP）。打开每个MP的链接，我正在进行并行请求以获得我试图计数的数字。我打算在MP的名称和政党的公司中获得以下三个数字
这是我想知道的数字
1.有多少法案提案是每位议员都签署的
1.有多少个问题提案是每个议员都签名的
1.每个议员在议会上发言多少次
为了计算和产量有多少法案有每个议员都有他们的签名，我试图写一个刮刀上的议员，其中有3层：

从列出所有议员的链接开始
从（1）访问每个MP的单独页面，其中显示上面定义的三个信息
3a）请求议案页面并使用len函数统计议案页面的数量3b）请求问题页面并使用len函数统计问题页面的数量3c）请求发言页面并使用len函数统计发言页面的数量
我想要的：我想在同一原始中产生3a，3b，3c的询问，其中包括MP的姓名和政党
*问题1）输出到csv时，只创建了speech count、name、part字段，没有显示bill proposal和question proposal字段
*问题2）每个MP有两个空值，我猜这对应于我上面在问题1中描述的值
*问题3）什么是更好的方法来重新构造我的代码，以便在同一行中输出三个值，而不是为我正在抓取的每个值打印三次MP

的数据

from scrapy import Spider
from scrapy.http import Request

import logging

class MvSpider(Spider):
    name = 'mv2'
    allowed_domains = ['tbmm.gov.tr']
    start_urls = ['https://www.tbmm.gov.tr/Milletvekilleri/liste']

    def parse(self, response):
        mv_list =  mv_list = response.xpath("//ul[@class='list-group list-group-flush']") #taking all MPs listed

        for mv in mv_list:
            name = mv.xpath("./li/div/div/a/text()").get() # MP's name taken
            party = mv.xpath("./li/div/div[@class='col-md-4 text-right']/text()").get().strip() #MP's party name taken
            partial_link = mv.xpath('.//div[@class="col-md-8"]/a/@href').get()
            full_link = response.urljoin(partial_link)

            yield Request(full_link, callback = self.mv_analysis, meta = {
                                                                            'name': name,
                                                                            'party': party
                                                                        })

    def mv_analysis(self, response):
        name = response.meta.get('name')
        party = response.meta.get('party')

        billprop_link_path = response.xpath(".//a[contains(text(),'İmzası Bulunan Kanun Teklifleri')]/@href").get()
        billprop_link = response.urljoin(billprop_link_path)

        questionprop_link_path = response.xpath(".//a[contains(text(),'Sahibi Olduğu Yazılı Soru Önergeleri')]/@href").get()
        questionprop_link = response.urljoin(questionprop_link_path)

        speech_link_path = response.xpath(".//a[contains(text(),'Genel Kurul Konuşmaları')]/@href").get()
        speech_link = response.urljoin(speech_link_path)

        yield Request(billprop_link, callback = self.bill_prop_counter, meta = {
                                                                            'name': name,
                                                                            'party': party
                                                                        })  #number of bill proposals to be requested

        yield Request(questionprop_link, callback = self.quest_prop_counter, meta = {
                                                                            'name': name,
                                                                            'party': party
                                                                        }) #number of question propoesals to be requested

        yield Request(speech_link, callback = self.speech_counter, meta = {
                                                                            'name': name,
                                                                            'party': party
                                                                        })  #number of speeches to be requested



# COUNTING FUNCTIONS

    def bill_prop_counter(self,response):

        name = response.meta.get('name')
        party = response.meta.get('party')

        billproposals = response.xpath("//tr[@valign='TOP']")

        yield  { 'bill_prop_count': len(billproposals),
                'name': name,
                'party': party}

    def quest_prop_counter(self, response):

        name = response.meta.get('name')
        party = response.meta.get('party')

        researchproposals = response.xpath("//tr[@valign='TOP']")

        yield {'res_prop_count': len(researchproposals),
               'name': name,
               'party': party}

    def speech_counter(self, response):

        name = response.meta.get('name')
        party = response.meta.get('party')

        speeches = response.xpath("//tr[@valign='TOP']")

        yield { 'speech_count' : len(speeches),
               'name': name,
               'party': party}

字符串

这是因为你产生的是dict而不是item对象，所以蜘蛛引擎不会有一个你想默认的字段指南。
为了使csv输出字段bill_prop_count和res_prop_count，您应该在代码中进行以下更改：
1 -创建一个包含所有所需字段的基本项对象-您可以在scrapy项目的items.py文件中创建此对象：

from scrapy import Item, Field

class MvItem(Item):
    name = Field()
    party = Field()
    bill_prop_count = Field()
    res_prop_count = Field()
    speech_count = Field()

字符串
2 -将创建的item对象导入到spider代码中，并生成用dict填充的项目，而不是单个dict：

from your_project.items import MvItem

...

# COUNTING FUNCTIONS
def bill_prop_counter(self,response):
    name = response.meta.get('name')
    party = response.meta.get('party')

    billproposals = response.xpath("//tr[@valign='TOP']")

    yield MvItem(**{ 'bill_prop_count': len(billproposals),
            'name': name,
            'party': party})

def quest_prop_counter(self, response):
    name = response.meta.get('name')
    party = response.meta.get('party')

    researchproposals = response.xpath("//tr[@valign='TOP']")

    yield MvItem(**{'res_prop_count': len(researchproposals),
           'name': name,
           'party': party})

def speech_counter(self, response):
    name = response.meta.get('name')
    party = response.meta.get('party')

    speeches = response.xpath("//tr[@valign='TOP']")

    yield MvItem(**{ 'speech_count' : len(speeches),
           'name': name,
           'party': party})

型
输出csv将包含项目的所有可能列：

bill_prop_count,name,party,res_prop_count,speech_count
,Abdullah DOĞRU,AK Parti,,11
,Mehmet Şükrü ERDİNÇ,AK Parti,,3
,Muharrem VARLI,MHP,,13
,Muharrem VARLI,MHP,0,
,Jülide SARIEROĞLU,AK Parti,,3
,İbrahim Halil FIRAT,AK Parti,,7
20,Burhanettin BULUT,CHP,,
,Ünal DEMİRTAŞ,CHP,,22
...

型
现在如果你想让所有三个计数在同一行，你必须改变你的蜘蛛的设计。可能一个计数函数在meta属性中传递项目。

通过Scrapy中的Meta从连续的并行解析函数中产生值

1条答案

相关问题

热门标签

最新问答