def parse(self, response):
collect = False
contents = []
for selector in response.xpath("//div[@class='entry-content']/*"):
val = selector.xpath("./text()").get()
if collect and selector.re('<p'):
contents.append(val)
continue
if val and selector.re(r'<h[23]'):
if "Characteristics" in val or "Diet" in val:
collect = True
else:
collect = False
yield {"contents" : contents}
1条答案
按热度按时间hwazgwia1#
您可以尝试提取
div
的所有子项,并执行正则表达式测试,以查看它是h2
还是h3
,然后测试文本是否包含"Diet"
或"Characteristics"
,如果通过,则收集所有为<p>
的同级。