散列CSV格式的列并以Base64格式输出

jqjz2hbq  于 2023-02-17  发布在  其他
关注(0)|答案(2)|浏览(164)

我还在接触Python,但我的目标是读取CSV文件,使用SHA256散列特定的列,然后以Base64输出。
下面是需要进行的转换示例

此计算器可在https://www.liavaag.org/English/SHA-Generator/中找到
这是我目前拥有的代码

import hashlib
import csv
import base64

with open('File1.csv') as csvfile:

    with open('File2.csv', 'w') as newfile:

        reader = csv.DictReader(csvfile)

        for i, r in enumerate(reader):
            #  writing csv headers
            if i == 0:
                newfile.write(','.join(r) + '\n')

            # hashing the 'CardNumber' column
            r['consumer_id'] = base64.b64encode(hashlib.sha256(r['consumer_id']).encode('utf-8')).digest()
            
            # writing the new row to the file with hashed 'CardNumber'
            newfile.write(','.join(r.values()) + '\n')

我收到的错误是

r['consumer_id'] = base64.b64encode(hashlib.sha256(r['consumer_id']).encode('utf-8')).digest()
TypeError: Strings must be encoded before hashing
1qczuiv0

1qczuiv01#

你的思路是正确的,只需要在一次性完成之前先迈出一步,看看它是如何拼凑起来的:

import hashlib
import base64

text = "1234567890"
encoded = text.encode('utf-8')
encoded = hashlib.sha256(encoded).digest()
encoded = base64.b64encode(encoded)
print(text, str(encoded, encoding="utf-8"))

这应该给予你:

1234567890 x3Xnt1ft5jDNCqERO9ECZhqziCnKUqZCKreChi8mhkY=

作为“一句话”:

r['consumer_id'] = str(base64.b64encode(hashlib.sha256(r['consumer_id'].encode('utf-8')).digest()), encoding="utf-8")

正如您所看到的,您当前的用法很接近,但只是有一些括号需要修正。
如果你想在一个循环中使用这个函数,比如在遍历单词列表或csv的行时,你可以这样做:

import hashlib
import base64

def encode_text(text):
    encoded = text.encode('utf-8')
    encoded = hashlib.sha256(encoded).digest()
    encoded = base64.b64encode(encoded)
    return str(encoded, encoding="utf-8")

words = "1234567890 Hello World".split()
for word in words:
    print(word, encode_text(word))

为您提供:

234567890 x3Xnt1ft5jDNCqERO9ECZhqziCnKUqZCKreChi8mhkY=
Hello GF+NsyJx/iX1Yab8k4suJkMG7DBO2lGAB9F2SCY4GWk=
World eK5kfcVUTSJxMKBoKlHjC8d3f7ttio8XAHRjo+zR1SQ=

假设您的代码的其余部分工作如您所愿,那么:

import hashlib
import csv
import base64

def encode_text(text):
    encoded = text.encode('utf-8')
    encoded = hashlib.sha256(encoded).digest()
    encoded = base64.b64encode(encoded)
    return str(encoded, encoding="utf-8")

with open('File1.csv') as csvfile:

    with open('File2.csv', 'w') as newfile:

        reader = csv.DictReader(csvfile)

        for i, r in enumerate(reader):
            #  writing csv headers
            if i == 0:
                newfile.write(','.join(r) + '\n')

            # hashing the 'CardNumber' column
            r['consumer_id'] = encode_text(r['consumer_id'])
            
            # writing the new row to the file with hashed 'CardNumber'
            newfile.write(','.join(r.values()) + '\n')
egdjgwm8

egdjgwm82#

除了JonSG关于正确获取散列/编码的回答之外,我还想对您如何阅读CSV文件发表一些评论。
我花了一点时间来理解你是如何处理CSV的头文件和正文的:

with open("File1.csv") as csvfile:
    with open("File2.csv", "w") as newfile:
        reader = csv.DictReader(csvfile)
        for i, r in enumerate(reader):
            print(i, r)
            if i == 0:
                newfile.write(",".join(r) + "\n")  # writing csv headers
            newfile.write(",".join(r.values()) + "\n")

起初,我没有意识到在dict上调用join()只会给予密钥;然后你继续加入这些价值观,真聪明
我认为使用补充的DictWriter会更清楚、更容易。
为了清楚起见,我将把阅读、处理和写作分开:

with open("File1.csv", newline="") as f_in:
    reader = csv.DictReader(f_in, skipinitialspace=True)
    rows = list(reader)

for row in rows:
    row["ID"] = encode_text(row["ID"])
    print(row)

with open("File2.csv", "w", newline="") as f_out:
    writer = csv.DictWriter(f_out, fieldnames=rows[0])
    writer.writeheader()
    writer.writerows(rows)

在您的示例中,您将创建自己的writer并需要为其给予字段名。我刚刚传入了第一行,DictWriter()构造函数使用该dict中的键来建立头值。您需要显式调用writeheader()方法,然后才能写入(已处理的)行。
我从这个文件1.csv开始:

ID, Phone, Email
1234680000000000, 123-456-7890, johnsmith@test.com

最后得到了这个文件2.csv:

ID,Phone,Email
tO2Knao73NzQP/rnBR5t8Hsm/XIQVnsrPKQlsXmpkb8=,123-456-7890,johnsmith@test.com

这种组织方式意味着所有的行首先被读入内存,你提到有“成千上万的条目”,但是对于这3个字段的数据来说,只有几百KB的RAM,也许是一MB的RAM。
如果您确实希望“流式”传输数据,则需要类似以下的内容:

reader = csv.DictReader(f_in, skipinitialspace=True)
writer = csv.DictWriter(f_out, fieldnames=reader.fieldnames)

writer.writeheader()

for row in reader:
    row["ID"] = encode_text(row["ID"])
    writer.writerow(row)

在本例中,我将reader.fieldnames传递给DictWriter构造函数的fieldnames=参数。
对于处理多个文件,我将自己打开和关闭它们,因为多个with open(...) as x在我看来可能很混乱:

f_in = open("File1.csv", newline="")
f_out = open("File2.csv", "w", newline="")

...

f_in.close()
f_out.close()

对于这些简单的实用程序脚本,我看不出上下文管理器有什么真实的的好处:如果程序失败,它将自动关闭文件。
但是传统的智慧是像你一样使用with open(...) as x上下文管理器,你可以像你一样嵌套separate them with a comma,或者如果你有Python 3.10+,使用分组括号来更清晰地看(也在Q/A中)。

相关问题