几个月前,我按照this Scrapy shell方法刮取了一个真实的地产清单网页,它工作得很完美。
当目标URL被加载时,我从Firefox(开发工具-〉标题)中提取了我的cookie
和user-agent
文本,我会得到一个成功的响应(200)
,并且能够从response.xpath
中提取项目。
例如:
url = 'https://www.realtor.com/realestateandhomes-search/McLean_VA/type-single-family-home/pg-1?pos=39.126499,-77.43902,38.685678,-76.779841,11&qdm=true'
cookie = '__fp=7387663eca6ba5161d1c58711dd65164; split=n; split_tcv=105; __vst=742f3db3-c514-4032-8650-21d4ccfdd85f; __ssn=a0587b1b-bc15-4e3d-8738-3fc8757071ab; __ssnstarttime=1656813474; criteria=pg%3D1%26sprefix%3D%252Frealestateandhomes-search%26typ%3D1%26area_type%3Dcity%26search_type%3Dcity%26city%3DMcLean%26state_code%3DVA%26state_id%3DVA%26lat%3D38.9435449%26long%3D-77.1929134%26county_fips%3D51059%26county_fips_multi%3D51059%26loc%3DMcLean%252C%2520VA%26locSlug%3DMcLean_VA%26county_needed_for_uniq%3Dfalse%26p…; _gid=GA1.2.260165497.1656813481; AMCV_AMCV_8853394255142B6A0A4C98A4%40AdobeOrg=-1124106680%7CMCMID%7C79412848632605408421861417717111497169%7CMCIDTS%7C19177%7CMCOPTOUT-1656820680s%7CNONE%7CvVersion%7C5.2.0; _fbp=fb.1.1656813480720.1479123934; AMCVS_AMCV_8853394255142B6A0A4C98A4%40AdobeOrg=1; adcloud={%22_les_v%22:%22y%2Crealtor.com%2C1656815313%22}; _clck=d7aq33|1|f2u|0; _clsk=sechzn|1656871412641|1|0|n.clarity.ms/collect; _uetsid=c56bc620fa7111ecb58a6f841dcc81b4; _uetvid=c56bcb70fa7111eca88d1bd5d241568e'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Firefox/102.0'
(fetch(scrapy.Request(url=url, headers={'cookie': cookie, 'user-agent': user_agent})), response)
listings = json.loads(response.xpath('/html/body/script[1]/text()').getall()[0])['props']['pageProps']['searchResults']['home_search']['results']
几个月后,我再次尝试(使用更新的cookie),却收到403错误--服务器理解请求,但拒绝授权:
In [7]: (fetch(scrapy.Request(url=url, headers={'cookie': cookie, 'user-agent':
...: user_agent})), response)
2022-07-03 14:14:43 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.realtor.com/realestateandhomes-search/McLean_VA/type-single-family-home?pos=39.069149,-77.355927,38.742653,-76.862935,11&qdm=true&view=map> (referer: None)
Out[7]: (None, None)
有什么想法我可以尝试让这个工作了吗?谢谢。
1条答案
按热度按时间kd3sttzy1#
cookie不是问题的根源。(见下文)我认为这里的问题是'view= map',它在头dict中寻找一个'referer'键(除了其他头键)。我建议在你的头中添加一个/一对'referer':“url”键。或者你可以尝试不那么繁重的方法:
输出: