python 替换并且+=极慢

c9qzyr3d 于 2023-01-16 发布在 Python

关注(0)|答案(3)|浏览(132)

我已经为一个翻译项目编写了以下代码，将一些字节数组翻译成“可读”文本。

with open(Path(cur_file), mode="rb") as file:
    contents = file.read()
    file.close()

text = ""
for i in range(0, len(contents), 2): # Since it's encoded in UTF16 or similar, there should always be pairs of 2 bytes
    byte = contents[i]
    byte_2 = contents[i+1]
    if byte == 0x00 and byte_2 == 0x00:
        text+="[0x00 0x00]"
    elif byte != 0x00 and byte_2 == 0x00:
        #print("Normal byte")
        if chr(byte) in printable:
            text+=chr(byte)
        elif byte == 0x00:
            pass
        else:
            text+="[" + "0x{:02x}".format(byte) + "]"
    else:
        #print("Special byte")
        text+="[" + "0x{:02x}".format(byte) + " " + "0x{:02x}".format(byte_2) + "]"
# Some dirty replaces - Probably slow but what do I know - It works
text = text.replace("[0x0e]n[0x01]","[USERNAME_1]") # Your name
text = text.replace("[0x0e]n[0x03]","[USERNAME_3]") # Your name
text = text.replace("[0x0e]n[0x08]","[TOWNNAME_8]") # Town name
text = text.replace("[0x0e]n[0x09]","[TOWNNAME_9]") # Town name
text = text.replace("[0x0e]n[0x0a]","[CHARNAME_A]") # Character name

text = text.replace("[0x0a]","[ENTER]") # Generic enter

lang_dict[emsbt_key_name] = text

虽然这段代码可以正常工作并生成如下输出：

Cancel[0x00 0x00]

更复杂的是，当我循环遍历60000个文件时，我偶然发现了一个性能问题。
我读过一些关于+=和大字符串的问题，人们说 join 更适合大字符串。* 然而 *，即使字符串少于1000个字符，一个文件也需要大约5秒的时间来存储，这是一个很长的时间。
我几乎觉得它开始得很快，然后越来越慢。
有什么方法可以优化这段代码呢？我觉得它也很糟糕。
任何帮助或线索都非常感谢。
编辑：添加了cProfile输出：

261207623 function calls (261180607 primitive calls) in 95.364 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    284/1    0.002    0.000   95.365   95.365 {built-in method builtins.exec}
        1    0.000    0.000   95.365   95.365 start.py:1(<module>)
        1    0.610    0.610   94.917   94.917 emsbt_to_json.py:21(to_json)
    11179   11.807    0.001   85.829    0.008 {method 'index' of 'list' objects}
 62501129   49.127    0.000   74.146    0.000 pathlib.py:578(__eq__)
125048857   18.401    0.000   18.863    0.000 pathlib.py:569(_cparts)
 63734640    6.822    0.000    6.828    0.000 {built-in method builtins.isinstance}
   160958    0.183    0.000    4.170    0.000 pathlib.py:504(_from_parts)
   160958    0.713    0.000    3.942    0.000 pathlib.py:484(_parse_args)
    68959    0.110    0.000    3.769    0.000 pathlib.py:971(absolute)
   160959    1.600    0.000    2.924    0.000 pathlib.py:56(parse_parts)
    91999    0.081    0.000    1.624    0.000 pathlib.py:868(__new__)
    68960    0.028    0.000    1.547    0.000 pathlib.py:956(rglob)
    68960    0.090    0.000    1.518    0.000 pathlib.py:402(_select_from)
    68959    0.067    0.000    1.015    0.000 pathlib.py:902(cwd)
       37    0.001    0.000    0.831    0.022 __init__.py:1(<module>)
   937462    0.766    0.000    0.798    0.000 pathlib.py:147(splitroot)
    11810    0.745    0.000    0.745    0.000 {method '__exit__' of '_io._IOBase' objects}
   137918    0.143    0.000    0.658    0.000 pathlib.py:583(__hash__)

编辑：使用line_profiler进一步检查后，发现问题甚至不在上面的代码中。它远远超出了我读取搜索索引以查看是否有+1文件（查看索引之前）的代码。这显然消耗了大量CPU时间。

python

来源：https://stackoverflow.com/questions/75010637/replace-and-is-abismally-slow

3条答案

按热度按时间

sxissh061#

以防万一它为您提供搜索路径，如果我是在你的情况下，我会做两个单独的检查超过100个文件，例如时间：

仅执行for循环所需的时间。
只做六次替换要花多少钱。

如果有任何需要大部分的总时间，我会尝试找到一个解决方案，只是为这一点。为原始更换有专门的软件设计的大规模更换。我希望它在某种程度上有所帮助。

赞(0）回复(0）举报 2023-01-16

vh0rcniy2#

您可以使用.format替换+=和+，如下所示

text = ""
for i in range(10):
    text += "[" + "{}".format(i) + "]"
print(text)  # [0][1][2][3][4][5][6][7][8][9]

其等同于

text = ""
for i in range(10):
    text = "{}[{}]".format(text, i)
print(text)  # [0][1][2][3][4][5][6][7][8][9]

注意，其他字符串格式化方式也可以如上所述使用，我选择使用.format，因为您已经在使用它。

赞(0）回复(0）举报 2023-01-16

wsxa1bj13#

事实证明，在此之前，我使用 * index方法+1 *（查看是否有路径更改）在列表中查找条目（在每次迭代中），这确实降低了性能。
在cProfile中，我们可以清楚地看到：

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
11179   11.807    0.001   85.829    0.008 {method 'index' of 'list' objects}

不是.replace!我的问题里根本没有这个。
真正让我明白这个调用是什么的（除了它不知何故被称为index之外），是另一个分析器：
我相信这就是Robert Kern的line_profiler的目的。
来源：https://stackoverflow.com/a/3927671/3525780
它一行一行地向我清晰地显示了哪些代码消耗了多少CPU周期/时间，比cProfile清晰得多。
一旦我发现了，我就把它换成了：

for ind, cur_file in enumerate(to_write):
        next_file = None
        if ind < len(to_write) - 1:
            next_file = to_write[ind+1]

如果没有实际的代码，这个答案可能没有多大意义，但我还是把它留在这里。

赞(0）回复(0）举报 2023-01-16

我来回答

python 替换并且+=极慢

3条答案

相关问题

热门标签

最新问答