Scrapy,有问题请跟我来

camsedfj  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(63)
def parse(self, response):

    countries = response.xpath('//div[@class="state-names-list-us"]/ul/a')

    for country in countries:
        link = country.xpath(".//@href").get()
        yield response.follow(url=link, callback=self.parse_frame)

def parse_frame(self, response):
    holder = response.xpath('//div[@class ="hs_cos_wrapper hs_cos_wrapper_widget '
                            'hs_cos_wrapper_type_rich_text"]/iframe')
    for hold in holder:
        test = hold.xpath('.//@src').get()
        yield response.follow(url=test)

parse方法获取到一个页面的链接,然后parse_frame使用该链接获取另一个链接,该链接包含要抓取的信息。
parse_frame得到了第一次迭代的链接,但没有得到其余迭代的链接。我应该如何解决这个问题,因为我希望得到所有迭代的链接。如果你看一下输出,它只得到了第一次迭代的链接。
{2022-07-21 14:18:33 [报废.核心.引擎]调试:已爬网(200)〈GET https://www.insulators.org/union-directory/mississippi>(参考:[Scrapy.蜘蛛中间件.非现场]调试:已筛选对“www.example.com”的非现场请求www.hfiunionhall.org:〈GET https://www.hfiunionhall.org/pages/localDetails.asp?where=DE&searchType=State>}

6ioyuze2

6ioyuze21#

下面的代码并不简单,但是它可以在更小的复杂性预算下完成这项工作:

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}

states_iso_list = pd.read_html('https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=53971')[0]['Alpha code'].tolist()

with requests.Session() as s:
    for x in states_iso_list:
        dfs = pd.read_html(f'https://www.hfiunionhall.org/pages/localDetails.asp?where={x}&searchType=State')
        print(dfs[0])
        r = s.get(f'https://www.hfiunionhall.org/pages/localDetails.asp?where={x}&searchType=State', headers=headers)
        soup = BeautifulSoup(r.text, 'html.parser')
        print(soup.text.strip()) ## you can slice&dice the info needed from here
        print('___________________________')

相关问题