json 如何使用Python保存来自网站的所有网络流量(包括请求和响应头)

fhg3lkii  于 12个月前  发布在  Python
关注(0)|答案(2)|浏览(119)

我试图找到一个对象,是下载到浏览器在加载一个网站。
这是网站https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en
我不太擅长网络技术之类的。
我正试图保存请求和响应头和实际的响应只使用链接到网站。
如果你看一下网络流量,你可以看到一个对象jobsearch.ftl?lang=en,它加载到最后,你可以看到响应和头。
下面是网络事件日志的屏幕截图,显示了请求和响应头。


的数据
以及实际的React。



这些是我要保存的对象。我如何才能做到这一点?
我已经尝试

import json
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, TimeoutException, StaleElementReferenceException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

chromepath = "~/chromedriver/chromedriver"

caps = DesiredCapabilities.CHROME
caps['goog:loggingPrefs'] = {'performance': 'ALL'}
driver = webdriver.Chrome(executable_path=chromepath, desired_capabilities=caps)
driver.get('https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en')

def process_browser_log_entry(entry):
    response = json.loads(entry['message'])['message']
    return response

browser_log = driver.get_log('performance') 
events = [process_browser_log_entry(entry) for entry in browser_log]
events = [event for event in events if 'Network.response' in event['method']]

字符串
但我只能看到一些标题,它们看起来像这样,

{'method': 'Network.responseReceivedExtraInfo',
  'params': {'blockedCookies': [],
   'headers': {'Cache-Control': 'private',
    'Connection': 'Keep-Alive',
    'Content-Encoding': 'gzip',
    'Content-Security-Policy': "frame-ancestors 'self'",
    'Content-Type': 'text/html;charset=UTF-8',
    'Date': 'Mon, 27 Sep 2021 18:18:10 GMT',
    'Keep-Alive': 'timeout=5, max=100',
    'P3P': 'CP="CAO PSA OUR"',
    'Server': 'Taleo Web Server 8',
    'Set-Cookie': 'locale=en; path=/careersection/; secure; HttpOnly',
    'Transfer-Encoding': 'chunked',
    'Vary': 'Accept-Encoding',
    'X-Content-Type-Options': 'nosniff',
    'X-UA-Compatible': 'IE=edge',
    'X-XSS-Protection': '1'},
   'headersText': 'HTTP/1.1 200 OK\r\nDate: Mon, 27 Sep 2021 18:18:10 GMT\r\nServer: Taleo Web Server 8\r\nCache-Control: private\r\nP3P: CP="CAO PSA OUR"\r\nContent-Encoding: gzip\r\nVary: Accept-Encoding\r\nX-Content-Type-Options: nosniff\r\nSet-Cookie: locale=en; path=/careersection/; secure; HttpOnly\r\nContent-Security-Policy: frame-ancestors \'self\'\r\nX-XSS-Protection: 1\r\nX-UA-Compatible: IE=edge\r\nKeep-Alive: timeout=5, max=100\r\nConnection: Keep-Alive\r\nTransfer-Encoding: chunked\r\nContent-Type: text/html;charset=UTF-8\r\n\r\n',
   'requestId': '1E3CDDE80EE37825EF2D9C909FFFAFF3',
   'resourceIPAddressSpace': 'Public'}},
 {'method': 'Network.responseReceived',
  'params': {'frameId': '1624E6F3E724CA508A6D55D556CBE198',
   'loaderId': '1E3CDDE80EE37825EF2D9C909FFFAFF3',
   'requestId': '1E3CDDE80EE37825EF2D9C909FFFAFF3',
   'response': {'connectionId': 26,


它们不包含我在Chrome中从Web检查器看到的所有信息。
我想得到整个响应和请求头以及实际的响应。这是正确的方法吗?有没有其他更好的方法,不使用selenium,只使用请求?

disho6za

disho6za1#

如果你想使用Selenium来处理它,你可以使用selenium-wire库。但是,如果你只关心特定的API,那么你可以使用requests库来命中API,然后打印requestresponse头的结果,而不是使用Selenium。
考虑到您正在寻找早期的使用Selenium的方法,实现此目的的一种方法是使用selenium-wire库。但是,它将给予所有后台API/请求的结果-然后您可以在将结果通过管道传输到文本文件或终端本身之后轻松过滤
使用pip install selenium-wire安装
使用pip install webdriver-manager安装webdriver-manager
使用pip install selenium==4.0.0.b4安装Selenium 4
使用此代码

from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service

svc    = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=svc)

driver.get('https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en')

for req in driver.requests:
  if req.response:
    print(
      req.url,
      req.response.status_code,
      req.headers,
      req.response.headers
    )

字符串
它给出了所有请求的详细输出-复制相关的一个-

https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en 200 

Host: epco.taleo.net
Connection: keep-alive
sec-ch-ua: "Chromium";v="94", "Google Chrome";v="94", ";Not A Brand";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8

Date: Tue, 28 Sep 2021 11:14:14 GMT
Server: Taleo Web Server 8
Cache-Control: private
P3P: CP="CAO PSA OUR"
Content-Encoding: gzip
Vary: Accept-Encoding
X-Content-Type-Options: nosniff
Set-Cookie: locale=en; path=/careersection/; secure; HttpOnly
Content-Security-Policy: frame-ancestors 'self'
X-XSS-Protection: 1
X-UA-Compatible: IE=edge
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html;charset=UTF-8

brc7rcf0

brc7rcf02#

你可以在selenium中使用JS。所以这会更容易:

var req = new XmlHttpRequest();
req.open("get", url_address_string);
req.send();
// when you get your data then:
x.getAllResponseHeaders();

字符串
XmlHttpRequest是javascript,所以你需要一些代码来使用answer。
好了,给你:

from selenium import webdriver
import time

driver = webdriver.Chrome()
driver.get("https://stackoverflow.com")
headers = driver.execute_script(""" var xhr = new XMLHttpRequest();
var rH;

xhr.addEventListener('loadend', (ev) => {
     window.rH =  xhr.getAllResponseHeaders(); // <-- we assing headers to proprty in window object so then we can use it
    console.log(rH);
    return rH;
})

xhr.open("get", "https://stackoverflow.com/")
xhr.send()
""")
# need to wait bcoz xhr request is async, this is dirty don't do this ;)
time.sleep(5)
# and we now can extract our 'rH' property from window. With javascript
headers = driver.execute_script("""return window.rH""")
# <--- "accept-ranges: bytes\r\ncache-control: private\r\ncontent-encoding: gzip\r\ncontent-security-policy: upgrade-insecure-requests; ....
print(headers)
# headers arejust string but parts of it are separated with \r\n so you need to
# headers.split("\r\n")
# then you will find a list


编辑2:你实际上并不需要HEADERS。当你的浏览器转到所需的URL时,其中一个响应变量为这个页面创建:_ftl
当你打开Dev tools -> console并输入“_ftl”你会看到对象。现在你想访问它。但这并不容易- _ftl是深度嵌套的对象。所以你必须选择它的属性并尝试访问。像:a = driver.execute_script("return window._ftl._acts")将导致:x1c 0d1x
但是访问数据将是一项艰巨的任务,_ftl是嵌套对象,selenium js序列化程序不能自动处理它。
所以另一个答案:

import requests
from bs4 import BeautifulSoup

url = "https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en"

g = requests.get(url)

soup = BeautifulSoup(g.text)
ftl_script = soup.find_all('script')[-1]
data_you_need =ftl_script.text


但这将导致原始字符串。你仍然需要找到一种方法来处理它。

相关问题