我对python和网页抓取还是个新手,我想得到一些建议。我已经创建了spider,但是json输出只提供了每个表的第一个元素。谁能告诉我这是什么原因吗?
import scrapy
class ActaSpider(scrapy.Spider):
name = 'acta_spider'
start_urls = ['https://www.fcf.cat/acta/2022/futbol-11/cadet-primera-divisio/grup-2/1c/la-salle-bonanova-ce-a/1c/lhospitalet-centre-esports-b']
def parse (self, response):
for actaelements in response.css('table.acta-table'):
try:
yield {
'name' : actaelements.css('a::text').get(),
'link' : actaelements.css('a').attrib['href'],
}
except:
yield {
'name' : actaelements.css('a::text').get(),
'link' : 'Link Error',
}
我的最终目标是创建一个JSON文件,为每个表创建必要的信息:
{
"DadesPartit":
{
"Temporada": "2021-2022",
"Categoria": "Cadet",
"Divisio": "Primera",
"Grup": 2,
"Jornada": 28
},
"TitularsCasa":
[
{
"Nom": "IGNACIO",
"Cognom":"FERNÁNDEZ ARTOLA",
"Link": "https://.."
},
{
"Nom": "JAIME",
"Cognom":"FERNÁNDEZ ARTOLA",
"Link": "https://.."
},
{
"Nom": "BRUNO",
"Cognom":"FERRÉ CORREA",
"Link": "https://.."
}
],
"SuplentsCasa":
[
{
"Nom": " MARC",
"Cognom":"GIMÉNEZ ABELLA",
"Link": "https://.."
}
],
"CosTecnicCasa":
[
{
"Nom": " JORDI",
"Cognom":"LORENTE VILLENA",
"Llicencia": "E"
}
],
"TargetesCasa":
[
{
"Nom": "IGNACIO",
"Cognom":"FERNÁNDEZ ARTOLA",
"Tipus": "Groga",
"Minut": 65
}
],
"Arbitres":
[
{
"Nom": " ALEJANDRO",
"Cognom":"ALVAREZ MOLINA",
"Delegacio": "Barcelona1"
}
],
"Gols":
[
{
"Nom": "NATXO",
"Cognom":"MONTERO RAYA",
"Minut": 5,
"Tipus": "Gol de penal"
}
],
"Estadi":
{
"Nom": "CAMP DE FUTBOL COL·LEGI LA SALLE BONANOVA,
"Direccio":"C/ DE SANT JOAN DE LA SALLE, 33, BARCELONA"
},
"TitularsFora":
[
{
"Nom": "MARTI",
"Cognom":"MOLINA MARTIMPE",
"Link": "https://.."
},
{
"Nom": " XAVIER",
"Cognom":"MORA AMOR",
"Link": "https://.."
},
{
"Nom": " IVAN",
"Cognom":"ARRANZ MORALES",
"Link": "https://.."
}
],
"SuplentsFora":
[
{
"Nom": "OLIVER",
"Cognom":"ALCAZAR SANCHEZ",
"Link": "https://.."
}
],
"CosTecnicFora":
[
{
"Nom": " RAFAEL",
"Cognom":"ESPIGARES MARTINEZ",
"Llicencia": "D"
}
],
"TargetesFora":
[
{
"Nom": " ORIOL",
"Cognom":"ALCOBA LAGE",
"Tipus": "Groga",
"Minut": 34
}
]
}
谢谢,琼
2条答案
按热度按时间zpqajqem1#
CSS选择器返回一个匹配元素的列表。因为只有一个元素与您的查询匹配,所以for循环只执行一次,并且只检索第一个链接。您可以做的一个小调整是使用xpath,您可以选择表的所有子元素,并且您的代码应该按预期工作。
只需将for循环更改为:
其余的代码应该按照预期的方式工作。
w8f9ii692#
这是因为你的css选择器是错误的,它只是针对表而不是项。你也可以删除
try except
,如果它是“无”,给予链接一个默认值。