为什么只将cURL复制到python时会出现禁止状态代码?

z4bn682m  于 2022-11-13  发布在  Python
关注(0)|答案(1)|浏览(90)

我试图通过python请求抓取一个网站(url = https://sports.betway.be/nl/sports/grp/soccer/belgium/first-division-a)。我通过network选项卡找到了下载JSON数据的相应名称(如果你愿意尝试的话,它被称为GetEvents)。在这里,我复制了cURL并将其转换为Python(通过https://curlconverter.com/),它给了我以下代码:

import requests

cookies = {     '_gcl_au': '1.1.1661658686.1660753544',     '_ga': 'GA1.2.1596995754.1660753545',     'BETWAY_ENSIGHTEN_PRIVACY_Marketing': '1',     'BETWAY_ENSIGHTEN_PRIVACY_Analytics': '1',     'bwui_cookieToastDismissed': 'true',     'ssc_DeviceId': '0bccd9bb-9f38-4ceb-b225-2956e2163d27',     'ssc_DeviceId_HttpOnly': '0bccd9bb-9f38-4ceb-b225-2956e2163d27',     'ai_user': 'v0JA/|2022-08-17T16:25:54.617Z',     'ens_firstVisit': '1660753692684',     'bw_BrowserId': '19353358536559595037829240027349442244',     '_sp_srt_id.c606': '543c5eed-4191-4fe4-9a12-2290ac66f159.1660753575.6.1661113080.1661108591.d6222a58-ed29-4651-ac7e-0b69163cf243',     'userLanguage': 'nl',     'SpinSportVisitId': '33f030d9-79ab-45de-b0bd-1ef52ae71f37',     'ssc_btag': '13d5a754-1cb1-45ec-88a5-a4ab3da0a288',     'TrackingVisitId': '13d5a754-1cb1-45ec-88a5-a4ab3da0a288',     'bw_SessionId': '7fab38de-b50f-4d35-8333-e86a4a026a6f',     'ai_session': 'hkBwK|1661929976364.3|1661929976364.3',     'domainCookie': 'betway.be',     '_gid': 'GA1.2.2070786832.1661929989',     '_gat_UA-1515961-1': '1',     'TimezoneOffset': '120',     '_gat': '1',     '_scid': '6e81eb56-b82b-4086-b7a0-783de175b7d8',     'ens_firstPageView': 'false',     'AMCVS_74756B615BE2FD4A0A495EB8%40AdobeOrg': '1',     'AMCV_74756B615BE2FD4A0A495EB8%40AdobeOrg': '359503849%7CMCIDTS%7C19236%7CMCMID%7C25582052571887037000237579884308421467%7CMCAAMLH-1662534800%7C6%7CMCAAMB-1662534800%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCCIDH%7C1868381894%7CMCOPTOUT-1661937200s%7CNONE%7CMCAID%7CNONE%7CvVersion%7C5.0.1',     '_gat_reg1': '1',     '_gat_ens': '1',     'gpv_pn': 'nl%3Asports%3Agrp%3Asoccer%3Abelgium%3Afirst-division-a',     's_cc': 'true',     'StaticResourcesVersion': '12.67.0.1',     '__cf_bm': 'ljcKg_UW1mTQeV80dsFzYz8bMMSdjFYNyXvAegwnB9c-1661930020-0-AZzywRq0XM6JfSMfWMZwpLg7AkUfHOh2ZOqCgf6xeJOmLuat6hyWkt/33xjyKT5yMZAPtLUM/OpG39S+/RtEOUg=', }

headers = {'authority': 'sports.betway.be','accept': 'application/json; charset=UTF-8','accept-language': 'nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7','content-type': 'application/json; charset=UTF-8',# Requests sorts cookies= alphabetically #'cookie': '_gcl_au=1.1.1661658686.1660753544; _ga=GA1.2.1596995754.1660753545) and it's giving me this codeeting=1; BETWAY_ENSIGHTEN_PRIVACY_Analytics=1; bwui_cookieToastDismissed=true; ssc_DeviceId=0bccd9bb-9f38-4ceb-b225-2956e2163d27; ssc_DeviceId_HttpOnly=0bccd9bb-9f38-4ceb-b225-2956e2163d27; ai_user=v0JA/|2022-08-17T16:25:54.617Z; ens_firstVisit=1660753692684; bw_BrowserId=19353358536559595037829240027349442244; _sp_srt_id.c606=543c5eed-4191-4fe4-9a12-2290ac66f159.1660753575.6.1661113080.1661108591.d6222a58-ed29-4651-ac7e-0b69163cf243; userLanguage=nl; SpinSportVisitId=33f030d9-79ab-45de-b0bd-1ef52ae71f37; ssc_btag=13d5a754-1cb1-45ec-88a5-a4ab3da0a288; TrackingVisitId=13d5a754-1cb1-45ec-88a5-a4ab3da0a288; bw_SessionId=7fab38de-b50f-4d35-8333-e86a4a026a6f; ai_session=hkBwK|1661929976364.3|1661929976364.3; domainCookie=betway.be; _gid=GA1.2.2070786832.1661929989; _gat_UA-1515961-1=1; TimezoneOffset=120; _gat=1; _scid=6e81eb56-b82b-4086-b7a0-783de175b7d8; ens_firstPageView=false; AMCVS_74756B615BE2FD4A0A495EB8%40AdobeOrg=1; AMCV_74756B615BE2FD4A0A495EB8%40AdobeOrg=359503849%7CMCIDTS%7C19236%7CMCMID%7C25582052571887037000237579884308421467%7CMCAAMLH-1662534800%7C6%7CMCAAMB-1662534800%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCCIDH%7C1868381894%7CMCOPTOUT-1661937200s%7CNONE%7CMCAID%7CNONE%7CvVersion%7C5.0.1; _gat_reg1=1; _gat_ens=1; gpv_pn=nl%3Asports%3Agrp%3Asoccer%3Abelgium%3Afirst-division-a; s_cc=true; StaticResourcesVersion=12.67.0.1; __cf_bm=ljcKg_UW1mTQeV80dsFzYz8bMMSdjFYNyXvAegwnB9c-1661930020-0-AZzywRq0XM6JfSMfWMZwpLg7AkUfHOh2ZOqCgf6xeJOmLuat6hyWkt/33xjyKT5yMZAPtLUM/OpG39S+/RtEOUg=','origin': 'https://sports.betway.be','referer': 'https://sports.betway.be/nl/sports/grp/soccer/belgium/first-division-a','sec-ch-ua': '"Chromium";v="104", " Not A;Brand";v="99", "Google Chrome";v="104"','sec-ch-ua-mobile': '?0','sec-ch-ua-platform': '"Windows"','sec-fetch-dest': 'empty','sec-fetch-mode': 'cors','sec-fetch-site': 'same-origin','user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36', }

json_data = {'LanguageId': 8,'ClientTypeId': 2,'BrandId': 3,'JurisdictionId': 3,'ClientIntegratorId': 1,'ExternalIds': [10121019, 10121020,10121021,],'MarketCName': 'win-draw-win','ScoreboardRequest': { 'ScoreboardType': 3, 'IncidentRequest': {}, },'BrowserId': 3,'OsId': 3,'ApplicationVersion': '','BrowserVersion': '104.0.0.0','OsVersion': 'NT 10.0','SessionId': None,'TerritoryId': 21,'CorrelationId': '625f145c-8a5e-40a5-af44-a0a13116961c','VisitId': '33f030d9-79ab-45de-b0bd-1ef52ae71f37','ViewName': 'sports','JourneyId': 'd857f2d0-b79f-4346-9683-f61d1e0c9854', }

response = requests.post('https://sports.betway.be/api/Events/v2/GetEvents', cookies=cookies, headers=headers, json=json_data)

但如果我运行这个程序,它会给我403禁止。

c6ubokkw

c6ubokkw1#

Many sites do not want you to scrape and they carefully check the request header.
The 403 Forbidden tells you something. They know what you are trying to do. They are giving you a challenge. I took a quick look at this site and they use a lot of cookies. I had to write my own code to receive and post the cookies because the cookie box in PHP's curl did not work well enough.
Look at your Browsers request when you go to the site. You Browser is doing a GET request, so you you must do so too.
It looks like you are off to a good start with the cookies. But there may be a timestamp in there.
So you must pay very close attention to your request header.
Sometimes I will try a very rare User Agent. Many sites will profile the header and they can do some things you may never think of. Like comparing the SSL handshaking because not all Browsers do it the same. That why I will try a UA they will not know the profile.
If I use Firefox I have to be careful that the site cannot tell the difference between my curl request and an actual Browser
Redirects are very common. On the initial response they will store a cookie send you a 302 redirect back to themselves and will check for the cookie. They can use JavaScript to do things too.
Many times you must accept their cookies and return them in you request header.
In general if I can turn off my Browser's JavaScript and I can get to the data I want, I know I can scrape the site.
The site you are trying scrape cannot be navigated without JavaScript. So you need to go directly to the page that has what you want. You may need to first go to the index page to get the cookies that may be needed to enter the page you targeted.
Sometimes I get lucky and find all the data I want buried as JSON in a JavaScript as an object.

相关问题