使用Scrapy spider从站点提取文本

guykilcj  于 2023-02-08  发布在  其他
关注(0)|答案(1)|浏览(164)

我正试图从亚马逊网站提取一本书的描述。注意:我正在使用Scrapy spider:这是亚马逊书的链接:https://www.amazon.com/Local-Woman-Missing-Mary-Kubica/dp/1665068671
这是包含内部说明文本的div:

<div aria-expanded="true" class="a-expander-content a-expander-partial-collapse-content 
a-expander-content-expanded" style="padding-bottom: 20px;"> <p><span class="a-text- 
bold">MP3 CD Format</span></p><p><span class="a-text-bold">People don’t just disappear 
without a trace…</span></p><p class="a-text-bold"><span class="a-text-bold">Shelby Tebow 
is the first to go missing. Not long after, Meredith Dickey and her six-year-old 
daughter, Delilah, vanish just blocks away from where Shelby was last seen, striking 
fear into their once-peaceful community. Are these incidents connected? After an elusive 
search that yields more questions than answers, the case eventually goes cold.</span> 
</p><p class="a-text-bold"><span class="a-text-bold">Now, eleven years later, Delilah 
shockingly returns. Everyone wants to know what happened to her, but no one is prepared 
for what they’ll find…</span></p><p class="a-text-bold"><span class="a-text-bold">In 
this smart and chilling thriller, master of suspense and New York Times bestselling 
author Mary Kubica takes domestic secrets to a whole new level, showing that some people 
will stop at nothing to keep the truth buried.</span></p><p></p>  </div>

其实我试过这句话

div = response.css(".a-expander-content.a-expander-partial-collapse-content.a-expander-content-expanded")
description = " ".join([re.sub('<.*?>', '', span) for span in response.css('.a-expander-content span').extract()])

它没有按预期工作。请如果你有任何想法分享它在这里。提前感谢

zour9fqk

zour9fqk1#

下面是代码:

import scrapy
from scrapy.spiders import Request

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    start_urls = ['https://www.amazon.com/dp/1665068671']

    def start_requests(self):
        yield Request(self.start_urls[0], callback=self.parse_book)

    def parse_book(self, response):
        description = "".join(response.css('[data-a-expander-name="book_description_expander"] .a-expander-content ::text').getall())
        yield {"description": description}
    • 输出:**
{'description': ' MP3 CD FormatPeople don’t just disappear without a trace…Shelby Tebow is the first to go missing. Not long after, Meredith Dickey and her six-year-old daughter, Delilah, vanish just blocks away from where Shelby was last seen, striking fear into their once-peaceful community. Are these incidents connected? After an elusive search that yields more questions than answers, the case eventually goes cold.Now, eleven years later, Delilah shockingly returns. Everyone wants to know what happened to her, but no one is prepared for what they’ll find…In this smart and chilling thriller, master of suspense and New York Times bestselling author Mary Kubica takes domestic secrets to a whole new level, showing that some people will stop at nothing to keep the truth buried.  '}

相关问题