python-3.x 如何下载网页为.mhtml

flvtvl50  于 2023-01-06  发布在  Python
关注(0)|答案(4)|浏览(254)

我可以成功打开URL并将生成的页面保存为. html文件。但是,我无法确定如何下载和保存. mhtml(网页,单个文件)。
我的代码是:

import urllib.parse, time
from urllib.parse import urlparse
import urllib.request

url = ('https://www.example.com')

encoded_url = urllib.parse.quote(url, safe='')

print(encoded_url)

base_url = ("https://translate.google.co.uk/translate?sl=auto&tl=en&u=")

translation_url = base_url+encoded_url

print(translation_url)

req = urllib.request.Request(translation_url, headers={'User-Agent': 'Mozilla/6.0'})

print(req)

response = urllib.request.urlopen(req)

time.sleep(15)

print(response)

webContent = response.read()

print(webContent)

f = open('GoogleTranslated.html', 'wb')

f.write(webContent)

print(f)

f.close

我已经尝试使用这个问题中捕获的细节来使用wget:但细节不完整(或者我根本无法理解)。
在现阶段,任何建议都将是有益的。

guz6ccqo

guz6ccqo1#

您是否尝试使用Selenium和Chrome Webdriver来保存页面?

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui

URL = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
FILE_NAME = ''

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome()
driver.get(URL)

# wait until body is loaded
WebDriverWait(driver, 60).until(visibility_of_element_located((By.TAG_NAME, 'body')))
time.sleep(1)
# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
if FILE_NAME != '':
    pyautogui.typewrite(FILE_NAME)
pyautogui.hotkey('enter')
vaj7vani

vaj7vani2#

我有一个更好的解决方案,它不需要任何可能的手工操作,也不需要指定存放mhtml文件的路径,我是从一个中文博客上学到的,关键是使用chrome-dev-tools命令。
下面的代码是一个例子。

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.qq.com/')

# Execute Chrome dev tool command to obtain the mhtml file
res = driver.execute_cdp_cmd('Page.captureSnapshot', {})

# 2. write file locally
with open('./store/qq.mhtml', 'w', newline='') as f:   
    f.write(res['data'])

driver.quit()

希望这对你有帮助!more things about chrome dev protocols

6jjcrrmo

6jjcrrmo3#

保存为mhtml,需要添加参数'--save-page-as-mhtml'

options = webdriver.ChromeOptions()
options.add_argument('--save-page-as-mhtml')
driver = webdriver.Chrome(options=options)
b1payxdu

b1payxdu4#

我就是这么写的。如果有错我很抱歉。
我创建了一个类,所以你可以使用它。下面三行是例子。
此外,您还可以根据需要更改睡眠的秒数。
顺便说一下,也支持非英语键盘,如日语和韩语键盘。

import chromedriver_binary
from selenium import webdriver
import pyautogui
import pyperclip
import uuid

class DonwloadMhtml(webdriver.Chrome):
    def __init__(self):
        super().__init__()
        self._first_save = True
        time.sleep(2)

    
    def save_page(self, url, filename=None):
        self.get(url)

        time.sleep(3)
        # open 'Save as...' to save html and assets
        pyautogui.hotkey('ctrl', 's')
        time.sleep(1)

        if filename is None:
            pyperclip.copy(str(uuid.uuid4()))
        else:
            pyperclip.copy(filename)
            
        time.sleep(1)
        pyautogui.hotkey('ctrl', 'v')
        time.sleep(2)
        
        
        if self._first_save:
            pyautogui.hotkey('tab')
            time.sleep(1)
            pyautogui.press('down')
            time.sleep(1)
            pyautogui.press('up')
            time.sleep(1)
            pyautogui.hotkey('enter')
            time.sleep(1)
            self._first_save = False
            
        pyautogui.hotkey('enter')
        time.sleep(1)

# example
dm = DonwloadMhtml()

dm.save_page('https://en.wikipedia.org/wiki/Python_(programming_language)', 'wikipedia_python')         # create file named "wikipedia_python.mhtml"
dm.save_page('https://www.python.org/')                                                                 # file named randomly based on uuid4

python3.8.10
selenium ==4.4.3

相关问题