PDF文件重复数据消除问题，内容相同，但在不同的时间段从docx生成

zzlelutf 于 2022-10-22 发布在 Python

关注(0)|答案(1)|浏览(123)

我在一个pdf文件重复数据消除项目中工作，并分析了python中的许多库，这些库读取文件，然后生成文件的哈希值，然后将其与下一个文件进行比较以进行复制-类似于下面的逻辑或使用python filecomp lib。但我发现这些逻辑的问题是，如果从源DOCX（保存到pdf）生成pdf，这些输出不会被视为重复，即使内容完全相同。为什么会发生这种情况？是否有其他逻辑读取内容，然后根据实际内容创建唯一的哈希值。

def calculate_hash_val(path, blocks=65536):
file = open(path, 'rb')
hasher = hashlib.md5()
data = file.read()    
while len(data) > 0:
    hasher.update(data)
    data = file.read()
file.close()
return hasher.hexdigest()

python

来源：https://stackoverflow.com/questions/74159812/pdf-file-dedupe-issue-with-same-content-but-generated-at-different-time-periods