为什么我在使用response.css时在Scrapy中得到一个空列表

1bqhqjot  于 2022-11-09  发布在  其他
关注(0)|答案(2)|浏览(232)

我用的密码是

import scrapy                                                                             
class JobSpider(scrapy.Spider):
name = 'job'

start_urls = [
    'https://jobs.goodlifefitness.com/listjobs/'
]

在Scrapy shell中,我为该链接添加了以下代码:

response.css('div.jobTitle a::attr(href)')

我得到了一个“[ ]“

vwkv1x7d

vwkv1x7d1#

这是因为整个页面都是从javascript呈现的。获取请求后,如果打开一个本地文件并粘贴html内容,您将看到99%的html是<script>标签。幸运的是,这些类型的页面很容易用requests-html库来抓取(不要与requests库混淆)。
例如:
pip install requests-html

from requests_html import HTMLSession
import json

session = HTMLSession()
full = []
for i in range(1, 6):
    r = session.get(f"https://jobs.goodlifefitness.com/listjobs/?pg={i}")
    r.html.render()
    lst = r.html.xpath("//div[@class='jobTitle']/a/@href")
    full += lst
json.dump(full, open("links.json","wt"))

输出

['/job/16881922/customer-service-representative-motivator-prince-george-river-oint-landing-prince-george-ca/', '/job/16881921/club-attendant-winnipeg-grant-ark-shopping-centre-winnipeg-ca/', '/job/16881919/sales-fitness-advisor-north-york-dufferin-and-finch-north-york-ca/', '/job/16881920/club-attendant-north-york-dufferin-and-finch-north-york-ca/', '/job/16881918/personal-trainer-regina-victoria-square-regina-ca/', '/job/16878045/customer-service-representative-motivator-mississauga-heartland-town-centre-mississauga-ca/', '/job/16878044/club-attendant-brampton-kingspoint-plaza-brampton-ca/', '/job/16878043/sales-fitness-advisor-vaughan-milani-and-highway-27-vaughan-ca/', '/job/16878042/sales-fitness-advisor-calgary-richmond-square-calgary-ca/', '/job/16878041/sales-fitness-advisor-toronto-yonge-and-st-clair-toronto-ca/', '/job/16878040/customer-service-representative-motivator-burlington-appleby-crossing-burlington-ca/', '/job/16878039/personal-trainer-north-york-yonge-and-finch-north-york-ca/', '/job/16873434/sales-and-service-representative-fitness-coach-whitby-taunton-and-brock-for-women-whitby-ca/', '/job/16873435/senior-fitness-coach-whitby-taunton-and-brock-for-women-whitby-ca/', '/job/16873433/club-attendant-brampton-mclaughlin-corners-west-brampton-ca/', '/job/16870781/personal-trainer-windsor-tecumseh-mall-windsor-ca/', '/job/16870780/fit4less-host-saskatoon-circle-west-plaza-saskatoon-ca/', '/job/16866062/service-technician-facility-kitchener-kitchener-ca/', '/job/16866061/service-technician-facility-mississauga-mississauga-ca/', '/job/16866060/sales-fitness-advisor-edmonton-rabbit-hill-road-edmonton-ca/', '/job/16866059/customer-service-representative-motivator-hamilton-queenston-place-hamilton-ca/', '/job/16866058/fit4less-host-markham-cochrane-markham-ca/', '/job/16866057/director-of-digital-marketing-remote-in-canada-london-ca/', '/job/16863233/group-fitness-instructor-bodycombat-edmonton-edmonton-ca/', '/job/16863232/group-fitness-instructor-bodypump-edmonton-edmonton-ca/', '/job/16863231/group-fitness-instructor-bodyattack-edmonton-edmonton-ca/', '/job/16863230/group-fitness-instructor-bodystep-edmonton-edmonton-ca/', '/job/16863228/fit4less-host-north-york-centerpoint-mall-north-york-ca/', '/job/16863227/fit4less-host-oakville-hyde-park-gate-oakville-ca/', '/job/16863226/fitness-manager-kitchener-fairway-plaza-kitchener-ca/', ...
tp5buhyn

tp5buhyn2#

我强烈建议你看一下他们的后端API。你可以使用chrome开发工具或代理来做这件事。
这允许您通过一个请求获取更多的数据。大多数情况下,BackendApis返回Json对象,这些对象非常好用,而不是在html文件中查找数据
我已经找到了你的具体情况的后端api,并写了一个小脚本,希望能做你想要的。

import requests
import json

url = "https://jobsapi-internal.m-cloud.io/api/job?callback=jobsCallback&sortfield=open_date&sortorder=descending&Limit=408&Organization=2239&offset=1"

r = requests.get(url).text

# I know not nice but I was too lazy

r =r.replace("jobsCallback(","")
r =r.replace("}]})","}]}")
json_obj = json.loads(r)
output = list()

for job in json_obj["queryResult"]:
    output.append(job["title"])

# Amount of jobs scraped

print(len(output))

# The available data of each job

print(json_obj["queryResult"][0].keys())

# All the jobs in a dictionary

print(output)

输出量:

408
dict_keys(['company_name', 'clientid', 'id', 'xc_id', 'sf_id', 'entity_status', 'scout_orgid', 'scout_userid', 'scout_teamid', 'language', 'ats_portalid', 'industry', 'function', 'title', 'ref', 'primary_city', 'primary_state', 'primary_zip', 'primary_country', 'primary_address', 'primary_location', 'addtnl_locations', 'description', 'primary_category', 'addtnl_categories', 'salary', 'job_type', 'travel', 'level', 'relocation', 'education', 'years_experience', 'open_positions', 'brand', 'department', 'shift', 'recruiter', 'parent_category', 'sub_category', 'business_unit', 'is_internal', 'employment_type', 'schedule', 'compliment', 'store_id', 'close_date', 'open_date', 'fndly_url', 'url', 'seo_url', 'location_type', 'importance', 'is_child_job', 'campaign_id', 'campaign_name', 'publish_to_cws', 'hidden', 'job_classifications', 'easy_apply', 'internal_url', 'internal_description', 'multi_select1', 'multi_select2', 'erp_eligible', 'erp_bonus', 'update_date'])
['Customer Service Representative (Motivator) - Prince George River Point Landing', 'Club Attendant - Winnipeg Grant Park Shopping Centre', 'Sales (Fitness Advisor) - North York Dufferin and Finch', 'Club Attendant - North York Dufferin and Finch', 'Personal Trainer - Regina Victoria Square', 'Customer Service Representative (Motivator) - Mississauga Heartland Town Centre', 'Club Attendant - Brampton Kingspoint Plaza', 'Sales (Fitness Advisor) - Vaughan Milani and Highway 27', 'Sales (Fitness Advisor) - Calgary Richmond Square', 'Sales (Fitness Advisor) - Toronto Yonge and St Clair', 'Customer Service Representative (Motivator) - Burlington Appleby Crossing', 'Personal Trainer - North York Yonge and Finch', 'Sales and Service Representative (Fitness Coach) - Whitby Taunton and Brock For Women', 'Senior Fitness Coach - Whitby Taunton and Brock For Women', 'Club Attendant - Brampton McLaughlin Corners West', 'Personal Trainer - Windsor Tecumseh Mall', 'Fit4Less Host - Saskatoon Circle West Plaza', ...]

相关问题