scrapy 爬行者:使用正确的筛选器构建JSON文件

v2g6jxz6  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(144)

我正在使用CSS类选择器来帮助我处理一个蜘蛛。在Scrapy shell上,如果我执行以下命令,我会得到我需要的所有元素的输出:

scrapy shell "https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b"

我根据收到的建议对Spider进行了修改:

import scrapy

class ActaSpider(scrapy.Spider):
    name = 'acta_spider'
    start_urls = [
        'https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b']

    def parse(self, response):

        print ("[ PARSE START ]")

        temporada = response.css(".print-acta-temp::text").get()
        temporada = temporada.replace('TEMPORADA ','')
        print (temporada)

        acta_comp = response.css(".print-acta-comp::text").get()
        acta_comp_llista = acta_comp.split(' ')
        print (acta_comp_llista)

        for actaelements in response.css('table.acta-table tbody tr'):

            yield {
                'name': actaelements.css('a::text').get(),
                'link': actaelements.css('a::attr(href)').get(default='Link Error'),
        }

现在,我需要根据构建网页所基于的12个表格的信息来构建JSON文件。

{
  "DadesPartit":
    {
      "Temporada": temporada,
      "Categoria": acta_comp_llista[1],
      "Divisio": acta_comp_llista[2],
      "Grup": acta_comp_llista[6],
      "Jornada": 28
    },
  "TitularsCasa":
    [
      {
        "Nom": "IGNACIO",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Link": "https://.."
      },
      {
        "Nom": "JAIME",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Link": "https://.."
      },
      {
        "Nom": "BRUNO",
        "Cognom":"FERRÉ CORREA",
        "Link": "https://.."
      }

    ],
  "SuplentsCasa":
    [
      {
        "Nom": " MARC",
        "Cognom":"GIMÉNEZ ABELLA",
        "Link": "https://.."
      }
    ],
  "CosTecnicCasa":
    [
      {
        "Nom": " JORDI",
        "Cognom":"LORENTE VILLENA",
        "Llicencia": "E"
      }
    ],
  "TargetesCasa": 
    [
      {
        "Nom": "IGNACIO",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Tipus": "Groga",
        "Minut": 65
      }
    ],
  "Arbitres":
    [
      {
        "Nom": "ALEJANDRO",
        "Cognom":"ALVAREZ MOLINA",
        "Delegacio": "Barcelona1"

      }
    ],
  "Gols":
    [
      {
        "Nom": "NATXO",
        "Cognom":"MONTERO RAYA",
        "Minut": 5,
        "Tipus": "Gol de penal"
      }
    ],
  "Estadi":
    {
      "Nom": "CAMP DE FUTBOL COL·LEGI LA SALLE BONANOVA",
      "Direccio":"C/ DE SANT JOAN DE LA SALLE, 33, BARCELONA"
    },
    "TitularsFora":
    [
      {
        "Nom": "MARTI",
        "Cognom":"MOLINA MARTIMPE",
        "Link": "https://.."
      },
      {
        "Nom": " XAVIER",
        "Cognom":"MORA AMOR",
        "Link": "https://.."
      },
      {
        "Nom": " IVAN",
        "Cognom":"ARRANZ MORALES",
        "Link": "https://.."
      }

    ],
  "SuplentsFora":
    [
      {
        "Nom": "OLIVER",
        "Cognom":"ALCAZAR SANCHEZ",
        "Link": "https://.."
      }
    ],
  "CosTecnicFora":
    [
      {
        "Nom": "RAFAEL",
        "Cognom":"ESPIGARES MARTINEZ",
        "Llicencia": "D"
      }
    ],
  "TargetesFora": 
    [
      {
        "Nom": "ORIOL",
        "Cognom":"ALCOBA LAGE",
        "Tipus": "Groga",
        "Minut": 34
      }
    ]
}

我想知道如何建造它。
谢谢,琼

vmpqdwk3

vmpqdwk31#

使用requestspandas要简单得多。可以执行以下操作:

import requests as r
import pandas as pd
a=r.get("https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b")
table_fb = pd.read_html(a.content)

您只需要为表建立table_fb索引。
下面是另一种选择:

import scrapy
import pandas as pd

class stack(scrapy.Spider):

    name = 'test'
    start_urls = ["https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b"]    
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url, 
                callback=self.parse
            )
    def parse(self, response):
        tables = pd.read_html(response.text)
        yield {
            'table1':tables[0],
            'table2':tables[1],
            'table3':tables[2],
            'table4':tables[3],
            'table5':tables[4],
            'table6':tables[5],
            'table7':tables[6],
            'table8':tables[7],
            'table9':tables[8],
            'table10':tables[9],
            'table11':tables[10],
            'table12':tables[11],
            'table13':tables[12],
            'table14':tables[13],

        }

相关问题