Scrapy，有问题请跟我来

camsedfj 于 2022-11-09 发布在其他

关注(0)|答案(1)|浏览(63)

def parse(self, response):

    countries = response.xpath('//div[@class="state-names-list-us"]/ul/a')

    for country in countries:
        link = country.xpath(".//@href").get()
        yield response.follow(url=link, callback=self.parse_frame)

def parse_frame(self, response):
    holder = response.xpath('//div[@class ="hs_cos_wrapper hs_cos_wrapper_widget '
                            'hs_cos_wrapper_type_rich_text"]/iframe')
    for hold in holder:
        test = hold.xpath('.//@src').get()
        yield response.follow(url=test)

parse方法获取到一个页面的链接，然后parse_frame使用该链接获取另一个链接，该链接包含要抓取的信息。
parse_frame得到了第一次迭代的链接，但没有得到其余迭代的链接。我应该如何解决这个问题，因为我希望得到所有迭代的链接。如果你看一下输出，它只得到了第一次迭代的链接。
{2022-07-21 14：18：33 [报废.核心.引擎]调试：已爬网（200）〈GET https://www.insulators.org/union-directory/mississippi>（参考：[Scrapy.蜘蛛中间件.非现场]调试：已筛选对“www.example.com”的非现场请求www.hfiunionhall.org：〈GET https://www.hfiunionhall.org/pages/localDetails.asp?where=DE&searchType=State>}

scrapy

来源：https://stackoverflow.com/questions/73071750/scrapy-issue-with-response-follow

1条答案

按热度按时间

6ioyuze21#

下面的代码并不简单，但是它可以在更小的复杂性预算下完成这项工作：

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}

states_iso_list = pd.read_html('https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=53971')[0]['Alpha code'].tolist()

with requests.Session() as s:
    for x in states_iso_list:
        dfs = pd.read_html(f'https://www.hfiunionhall.org/pages/localDetails.asp?where={x}&searchType=State')
        print(dfs[0])
        r = s.get(f'https://www.hfiunionhall.org/pages/localDetails.asp?where={x}&searchType=State', headers=headers)
        soup = BeautifulSoup(r.text, 'html.parser')
        print(soup.text.strip()) ## you can slice&dice the info needed from here
        print('___________________________')

赞(0）回复(0）举报 2022-11-09

我来回答

Scrapy，有问题请跟我来

1条答案

相关问题

热门标签

最新问答