如何使用Python计算文件系统目录的哈希值?

ki0zmccv  于 2023-02-07  发布在  Python
关注(0)|答案(7)|浏览(160)

我使用这个代码来计算哈希值的文件:

m = hashlib.md5()
with open("calculator.pdf", 'rb') as fh:
    while True:
        data = fh.read(8192)
        if not data:
            break
        m.update(data)
    hash_value = m.hexdigest()

    print  hash_value

当我在一个文件夹“文件夹“上试用时,我得到了

IOError: [Errno 13] Permission denied: folder

如何计算文件夹的哈希值?

lvjbypge

lvjbypge1#

使用checksumdirpython包来计算目录的校验和/散列。它可以在https://pypi.python.org/pypi/checksumdir上找到

用法:

import checksumdir
hash = checksumdir.dirhash("c:\\temp")
print hash
jutyujz0

jutyujz02#

下面是一个使用pathlib.Path而不是os. walk的实现。它在迭代之前对目录内容进行排序,以便在多个平台上可以重复。它还使用文件/目录的名称更新哈希,因此添加空文件和目录将更改哈希。
带类型注解的版本(Python 3.6或更高版本):

import hashlib
from _hashlib import HASH as Hash
from pathlib import Path
from typing import Union

def md5_update_from_file(filename: Union[str, Path], hash: Hash) -> Hash:
    assert Path(filename).is_file()
    with open(str(filename), "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash.update(chunk)
    return hash

def md5_file(filename: Union[str, Path]) -> str:
    return str(md5_update_from_file(filename, hashlib.md5()).hexdigest())

def md5_update_from_dir(directory: Union[str, Path], hash: Hash) -> Hash:
    assert Path(directory).is_dir()
    for path in sorted(Path(directory).iterdir(), key=lambda p: str(p).lower()):
        hash.update(path.name.encode())
        if path.is_file():
            hash = md5_update_from_file(path, hash)
        elif path.is_dir():
            hash = md5_update_from_dir(path, hash)
    return hash

def md5_dir(directory: Union[str, Path]) -> str:
    return str(md5_update_from_dir(directory, hashlib.md5()).hexdigest())

不带类型注解:

import hashlib
from pathlib import Path

def md5_update_from_file(filename, hash):
    assert Path(filename).is_file()
    with open(str(filename), "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash.update(chunk)
    return hash

def md5_file(filename):
    return md5_update_from_file(filename, hashlib.md5()).hexdigest()

def md5_update_from_dir(directory, hash):
    assert Path(directory).is_dir()
    for path in sorted(Path(directory).iterdir()):
        hash.update(path.name.encode())
        if path.is_file():
            hash = md5_update_from_file(path, hash)
        elif path.is_dir():
            hash = md5_update_from_dir(path, hash)
    return hash

def md5_dir(directory):
    return md5_update_from_dir(directory, hashlib.md5()).hexdigest()

如果您只需要散列目录,则为压缩版本:

def md5_update_from_dir(directory, hash):
    assert Path(directory).is_dir()
    for path in sorted(Path(directory).iterdir(), key=lambda p: str(p).lower()):
        hash.update(path.name.encode())
        if path.is_file():
            with open(path, "rb") as f:
                for chunk in iter(lambda: f.read(4096), b""):
                    hash.update(chunk)
        elif path.is_dir():
            hash = md5_update_from_dir(path, hash)
    return hash

def md5_dir(directory):
    return md5_update_from_dir(directory, hashlib.md5()).hexdigest()

用法:md5_hash = md5_dir("/some/directory")

2eafrhcq

2eafrhcq3#

这个Recipe提供了一个很好的函数来完成您所要求的任务,我已经将它修改为使用MD5散列,而不是您最初的问题所要求的SHA1

def GetHashofDirs(directory, verbose=0):
  import hashlib, os
  SHAhash = hashlib.md5()
  if not os.path.exists (directory):
    return -1

  try:
    for root, dirs, files in os.walk(directory):
      for names in files:
        if verbose == 1:
          print 'Hashing', names
        filepath = os.path.join(root,names)
        try:
          f1 = open(filepath, 'rb')
        except:
          # You can't open the file for some reason
          f1.close()
          continue

        while 1:
          # Read file in as little chunks
          buf = f1.read(4096)
          if not buf : break
          SHAhash.update(hashlib.md5(buf).hexdigest())
        f1.close()

  except:
    import traceback
    # Print the stack traceback
    traceback.print_exc()
    return -2

  return SHAhash.hexdigest()

您可以像这样使用它:

print GetHashofDirs('folder_to_hash', 1)

输出如下所示,因为它散列了每个文件:

...
Hashing file1.cache
Hashing text.txt
Hashing library.dll
Hashing vsfile.pdb
Hashing prog.cs
5be45c5a67810b53146eaddcae08a809

此函数调用的返回值作为散列返回。

lf5gs5x2

lf5gs5x24#

我不太喜欢答案中提到的菜谱是怎么写的,我有一个更简单的版本:

import hashlib
import os

def hash_directory(path):
    digest = hashlib.sha1()

    for root, dirs, files in os.walk(path):
        for names in files:
            file_path = os.path.join(root, names)

            # Hash the path and add to the digest to account for empty files/directories
            digest.update(hashlib.sha1(file_path[len(path):].encode()).digest())

            # Per @pt12lol - if the goal is uniqueness over repeatability, this is an alternative method using 'hash'
            # digest.update(str(hash(file_path[len(path):])).encode())

            if os.path.isfile(file_path):
                with open(file_path, 'rb') as f_obj:
                    while True:
                        buf = f_obj.read(1024 * 1024)
                        if not buf:
                            break
                        digest.update(buf)

    return digest.hexdigest()

我发现每当遇到alias之类的东西(显示在os.walk()中,但您不能直接打开它)时,通常都会抛出异常。
如果在我试图哈希的目录中有一个实际的文件,但它无法打开,跳过该文件并继续不是一个好的解决方案。这会影响哈希的结果。最好完全终止哈希尝试。在这里,try语句将被 Package 在对hash_directory()函数的调用周围。

>>> try:
...   print(hash_directory('/tmp'))
... except:
...   print('Failed!')
... 
e2a075b113239c8a25c7e1e43f21e8f2f6762094
>>>
kse8i1jr

kse8i1jr5#

我不断看到这个代码通过各种论坛传播。
ActiveState recipe answer可以工作,但是,正如Antonio指出的,它不能保证跨文件系统的可重复性,因为它不能以相同的顺序显示文件(试试看)。

for root, dirs, files in os.walk(directory):
  for names in files:

for root, dirs, files in os.walk(directory):
  for names in sorted(files):

(Yes我太懒了。这只对文件名排序,而不对目录排序。同样的原理也适用)

vbkedwbf

vbkedwbf6#

使用校验和目录https://pypi.org/project/checksumdir/

directory  = '/path/to/directory/'
md5hash    = dirhash(directory, 'md5')
xtfmy6hx

xtfmy6hx7#

我进一步优化了安迪的回答。
下面是python3而不是python2的实现,它使用SHA1,处理一些需要编码的情况,是linted的,并且包含一些教义。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""dir_hash: Return SHA1 hash of a directory.
- Copyright (c) 2009 Stephen Akiki, 2018 Joe Flack
- MIT License (http://www.opensource.org/licenses/mit-license.php)
- http://akiscode.com/articles/sha-1directoryhash.shtml
"""
import hashlib
import os

def update_hash(running_hash, filepath, encoding=''):
    """Update running SHA1 hash, factoring in hash of given file.

    Side Effects:
        running_hash.update()
    """
    if encoding:
        file = open(filepath, 'r', encoding=encoding)
        for line in file:
            hashed_line = hashlib.sha1(line.encode(encoding))
            hex_digest = hashed_line.hexdigest().encode(encoding)
            running_hash.update(hex_digest)
        file.close()
    else:
        file = open(filepath, 'rb')
        while True:
            # Read file in as little chunks.
            buffer = file.read(4096)
            if not buffer:
                break
            running_hash.update(hashlib.sha1(buffer).hexdigest())
        file.close()

def dir_hash(directory, verbose=False):
    """Return SHA1 hash of a directory.

    Args:
        directory (string): Path to a directory.
        verbose (bool): If True, prints progress updates.

    Raises:
        FileNotFoundError: If directory provided does not exist.

    Returns:
        string: SHA1 hash hexdigest of a directory.
    """
    sha_hash = hashlib.sha1()

    if not os.path.exists(directory):
        raise FileNotFoundError

    for root, dirs, files in os.walk(directory):
        for names in files:
            if verbose:
                print('Hashing', names)
            filepath = os.path.join(root, names)
            try:
                update_hash(running_hash=sha_hash,
                            filepath=filepath)
            except TypeError:
                update_hash(running_hash=sha_hash,
                            filepath=filepath,
                            encoding='utf-8')

    return sha_hash.hexdigest()

相关问题