scrapy 为什么解析只发生在每个表的第一项上

n8ghc7c1  于 2022-11-09  发布在  其他
关注(0)|答案(2)|浏览(177)

我对python和网页抓取还是个新手,我想得到一些建议。我已经创建了spider,但是json输出只提供了每个表的第一个元素。谁能告诉我这是什么原因吗?

import scrapy

class ActaSpider(scrapy.Spider):
    name = 'acta_spider'
    start_urls = ['https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b']

    def parse (self, response):
        for actaelements in response.css('table.acta-table'):
            try:
              yield {
                'name' : actaelements.css('a::text').get(),
                'link' : actaelements.css('a').attrib['href'],
            }
            except:
              yield {
                'name' : actaelements.css('a::text').get(),
                'link' : 'Link Error',
            }

我的最终目标是创建一个JSON文件,为每个表创建必要的信息:

{
  "DadesPartit":
    {
      "Temporada": "2021-2022",
      "Categoria": "Cadet",
      "Divisio": "Primera",
      "Grup": 2,
      "Jornada": 28
    },
  "TitularsCasa":
    [
      {
        "Nom": "IGNACIO",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Link": "https://.."
      },
      {
        "Nom": "JAIME",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Link": "https://.."
      },
      {
        "Nom": "BRUNO",
        "Cognom":"FERRÉ CORREA",
        "Link": "https://.."
      }

    ],
  "SuplentsCasa":
    [
      {
        "Nom": " MARC",
        "Cognom":"GIMÉNEZ ABELLA",
        "Link": "https://.."
      }
    ],
  "CosTecnicCasa":
    [
      {
        "Nom": " JORDI",
        "Cognom":"LORENTE VILLENA",
        "Llicencia": "E"
      }
    ],
  "TargetesCasa": 
    [
      {
        "Nom": "IGNACIO",
        "Cognom":"FERNÁNDEZ ARTOLA",
        "Tipus": "Groga",
        "Minut": 65
      }
    ],
  "Arbitres":
    [
      {
        "Nom": " ALEJANDRO",
        "Cognom":"ALVAREZ MOLINA",
        "Delegacio": "Barcelona1"

      }
    ],
  "Gols":
    [
      {
        "Nom": "NATXO",
        "Cognom":"MONTERO RAYA",
        "Minut": 5,
        "Tipus": "Gol de penal"
      }
    ],
  "Estadi":
    {
      "Nom": "CAMP DE FUTBOL COL·LEGI LA SALLE BONANOVA,
      "Direccio":"C/ DE SANT JOAN DE LA SALLE, 33, BARCELONA"
    },
    "TitularsFora":
    [
      {
        "Nom": "MARTI",
        "Cognom":"MOLINA MARTIMPE",
        "Link": "https://.."
      },
      {
        "Nom": " XAVIER",
        "Cognom":"MORA AMOR",
        "Link": "https://.."
      },
      {
        "Nom": " IVAN",
        "Cognom":"ARRANZ MORALES",
        "Link": "https://.."
      }

    ],
  "SuplentsFora":
    [
      {
        "Nom": "OLIVER",
        "Cognom":"ALCAZAR SANCHEZ",
        "Link": "https://.."
      }
    ],
  "CosTecnicFora":
    [
      {
        "Nom": " RAFAEL",
        "Cognom":"ESPIGARES MARTINEZ",
        "Llicencia": "D"
      }
    ],
  "TargetesFora": 
    [
      {
        "Nom": " ORIOL",
        "Cognom":"ALCOBA LAGE",
        "Tipus": "Groga",
        "Minut": 34
      }
    ]
}

谢谢,琼

zpqajqem

zpqajqem1#

CSS选择器返回一个匹配元素的列表。因为只有一个元素与您的查询匹配,所以for循环只执行一次,并且只检索第一个链接。您可以做的一个小调整是使用xpath,您可以选择表的所有子元素,并且您的代码应该按预期工作。
只需将for循环更改为:

for actalements in response.xpath('//table[@class="acta-table"]/*'):

其余的代码应该按照预期的方式工作。

w8f9ii69

w8f9ii692#

这是因为你的css选择器是错误的,它只是针对表而不是项。你也可以删除try except,如果它是“无”,给予链接一个默认值。

import scrapy

class ActaSpider(scrapy.Spider):
    name = 'acta_spider'
    start_urls = ['https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b']

    def parse(self, response):
        for actaelements in response.css('table.acta-table tbody tr'):
            yield {
                'name': actaelements.css('a::text').get(),
                'link': actaelements.css('a::attr(href)').get(default='Link Error'),
            }

相关问题