scrapy 如何抓取ID中包含class_id的所有元素的文本？

3lxsmp7m 于 2022-11-09 发布在其他

关注(0)|答案(1)|浏览(141)

我有下面的代码，我想我已经快开始工作了。我可以得到一个包含每个锚元素值的选择器数组，其中包含一个包含字符串class_id的id。我试图做的是得到所有这些锚元素的文本节点子节点。有人能告诉我怎么做吗？谢谢。

import scrapy;

# with open('../econ.html', 'r') as f:

    #html_string = f.read()

econ_headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Origin': 'https://pisa.ucsc.edu',
    'Accept-Language': 'en-us',
    'Host': 'pisa.ucsc.edu',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15',
    'Referer': 'https://pisa.ucsc.edu/class_search/',
    'Accept-Encoding': ['gzip', 'deflate', 'br'],
    'Connection': 'keep-alive',
    'Content-Type': 'application/x-www-form-urlencoded'}

class EconSpider(scrapy.Spider):
    name = "econ"

    def start_requests(self):

        urls = [
            'https://pisa.ucsc.edu/class_search/index.php'
            ]
        for url in urls:
            yield scrapy.Request(url=url, method="POST", headers=econ_headers, body='action=results&binds[:term]=2210&binds[:subject]=ECON&binds[:reg_status]=O&rec_start=0&rec_dur=1000', callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        print(response.xpath('//a[contains(@id, "class_id")] *::text'))
        filename = f'class-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')

scrapy

来源：https://stackoverflow.com/questions/72521807/how-to-scrape-the-text-of-all-elements-with-an-id-containing-class-id