在Python中散列文件

txu3uszq 于 2022-12-25 发布在 Python

关注(0)|答案(8)|浏览(105)

我想让python读取EOF，这样我就可以得到一个合适的哈希值，不管它是sha1还是md5。请帮助我。下面是我到目前为止得到的结果：

import hashlib

inputFile = raw_input("Enter the name of the file:")
openedFile = open(inputFile)
readFile = openedFile.read()

md5Hash = hashlib.md5(readFile)
md5Hashed = md5Hash.hexdigest()

sha1Hash = hashlib.sha1(readFile)
sha1Hashed = sha1Hash.hexdigest()

print "File Name: %s" % inputFile
print "MD5: %r" % md5Hashed
print "SHA1: %r" % sha1Hashed

python

来源：https://stackoverflow.com/questions/22058048/hashing-a-file-in-python

8条答案

按热度按时间

htrmnn0y1#

TL;DR使用缓冲区以避免占用大量内存。

我相信，当我们考虑到处理非常大的文件所涉及的内存问题时，我们就找到了问题的症结所在，我们不希望这个坏家伙为了一个2GB的文件而耗费2GB的内存，所以，正如pasztorpisti所指出的，我们必须分块处理那些更大的文件！

import sys
import hashlib

# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536  # lets read stuff in 64kb chunks!

md5 = hashlib.md5()
sha1 = hashlib.sha1()

with open(sys.argv[1], 'rb') as f:
    while True:
        data = f.read(BUF_SIZE)
        if not data:
            break
        md5.update(data)
        sha1.update(data)

print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))

我们所做的就是在使用hashlib的dandy update方法时，用64 kb的块来更新这个坏家伙的哈希值，这样我们使用的内存比一次性哈希这个坏家伙所需的2gb要少得多！
您可以通过以下方式进行测试：

$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d  bigfile

所有这些都在右侧的相关问题中列出：Get MD5 hash of big files in Python

附录！

一般来说，在写python的时候，养成遵循[pep-8][4]的习惯是有帮助的。例如，在python中，变量通常是用下划线分隔的，而不是用camelCased分隔的。但这只是风格，没有人真正关心这些东西，除了那些不得不阅读糟糕风格的人......可能是你在几年后阅读这段代码。

赞(0）回复(0）举报 2022-12-25

eeq64g8w2#

为了正确有效地计算文件的哈希值（Python 3中）：

以二进制模式打开文件（例如，将'b'添加到filemode），以避免字符编码和行尾转换问题。
不要将整个文件读入内存，因为这是对内存的浪费，而是按顺序逐块读取，并更新每个块的哈希值。
消除双缓冲，即不使用缓冲IO，因为我们已经使用了最佳块大小。
使用readinto()可避免缓冲区扰动。

示例：

import hashlib

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        while n := f.readinto(mv):
            h.update(mv[:n])
    return h.hexdigest()

注意while循环使用了assignment expression，这在Python 3.8之前的版本中是不可用的。
对于较早的Python 3版本，您可以使用一个等效的变体：

import hashlib

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        for n in iter(lambda : f.readinto(mv), 0):
            h.update(mv[:n])
    return h.hexdigest()

赞(0）回复(0）举报 2022-12-25

5ktev3wc3#

我的提议很简单：

def get_digest(file_path):
    h = hashlib.sha256()

    with open(file_path, 'rb') as file:
        while True:
            # Reading is buffered, so we can read smaller chunks.
            chunk = file.read(h.block_size)
            if not chunk:
                break
            h.update(chunk)

    return h.hexdigest()

这里的所有其他答案似乎都太复杂了。（以理想的方式，或者如果你有更多关于底层存储的信息，你可以配置缓冲），所以最好是以块的形式读取哈希函数找到的理想值，这使得计算哈希函数更快或者至少CPU占用更少。使用Python缓冲并控制应该控制的内容：数据消费者认为理想的哈希块大小。

赞(0）回复(0）举报 2022-12-25

0ve6wy6x4#

下面是一个Python 3的POSIX解决方案（不是Windows！），它使用mmap将对象Map到内存。

import hashlib
import mmap

def sha256sum(filename):
    h  = hashlib.sha256()
    with open(filename, 'rb') as f:
        with mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) as mm:
            h.update(mm)
    return h.hexdigest()

赞(0）回复(0）举报 2022-12-25

pb3skfrl5#

我已经编写了一个模块，它能够散列大文件与不同的算法。

pip3 install py_essentials

按如下方式使用模块：

from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")

赞(0）回复(0）举报 2022-12-25

7kjnsjlb6#

你不需要用5-20行代码来定义一个函数！使用 pathlib 和 hashlib 库可以保存你的时间，py_essentials 也是另一种解决方案，但是第三方是*****。

from pathlib import Path
import hashlib

filepath = '/path/to/file'
filebytes = Path(filepath).read_bytes()

filehash_sha1 = hashlib.sha1(filebytes)
filehash_md5 = hashlib.md5(filebytes)

print(f'MD5: {filehash_md5}')
print(f'SHA1: {filehash_sha1}')

我在这里使用了一些变量来显示步骤，您知道如何避免它。
你觉得下面的功能怎么样？

from pathlib import Path
import hashlib

def compute_filehash(filepath: str, hashtype: str) -> str:
    """Computes the requested hash for the given file.

    Args:
        filepath: The path to the file to compute the hash for.
        hashtype: The hash type to compute.

          Available hash types:
            md5, sha1, sha224, sha256, sha384, sha512, sha3_224,
            sha3_256, sha3_384, sha3_512, shake_128, shake_256

    Returns:
        A string that represents the hash.
    
    Raises:
        ValueError: If the hash type is not supported.
    """
    if hashtype not in ['md5', 'sha1', 'sha224', 'sha256', 'sha384',
                        'sha512', 'sha3_224', 'sha3_256', 'sha3_384',
                        'sha3_512', 'shake_128', 'shake_256']:
        raise ValueError(f'Hash type {hashtype} is not supported.')
    
    return getattr(hashlib, hashtype)(
        Path(filepath).read_bytes()).hexdigest()

赞(0）回复(0）举报 2022-12-25

ego6inou7#

FWIW，我更喜欢这个版本，它有相同的内存和性能特征作为maxschlepzig的答案，但更可读IMO：

import hashlib

def sha256sum(filename, bufsize=128 * 1024):
    h = hashlib.sha256()
    buffer = bytearray(bufsize)
    # using a memoryview so that we can slice the buffer without copying it
    buffer_view = memoryview(buffer)
    with open(filename, 'rb', buffering=0) as f:
        while True:
            n = f.readinto(buffer_view)
            if not n:
                break
            h.update(buffer_view[:n])
    return h.hexdigest()

赞(0）回复(0）举报 2022-12-25

7gcisfzg8#

import hashlib
user = input("Enter ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
with open("encrypted.txt","w") as e:
    print(h2,file=e)

with open("encrypted.txt","r") as e:
    p = e.readline().strip()
    print(p)

赞(0）回复(0）举报 2022-12-25

我来回答

在Python中散列文件

8条答案

附录！

相关问题

热门标签

最新问答