pandas 循环在每次迭代后变慢

epfja78i 于 2023-06-04 发布在其他

关注(0)|答案(4)|浏览(257)

我有一个python脚本，内容如下：
1.我有一个JSON列表
1.我创建一个空的pandas Dataframe
1.我在这个列表上运行for循环
1.我在每次迭代时用我感兴趣的（相同的）键创建一个空字典
1.我在每次迭代时解析JSON以检索键的值
1.我在每次迭代时将字典附加到pandas Dataframe 中。
这样做的问题是，在每次迭代中，处理时间都在增加。具体而言：

0-1000 documents -> 5 seconds
1000-2000 documents -> 6 seconds
2000-3000 documents -> 7 seconds
...
10000-11000 documents -> 18 seconds
11000-12000 documents -> 19 seconds
...
22000-23000 documents -> 39 seconds
23000-24000 documents -> 42 seconds
...
34000-35000 documents -> 69 seconds
35000-36000 documents -> 72 seconds

为什么会这样？
我的代码看起来像这样：

# 'documents' is the list of jsons

columns = ['column_1', 'column_2', ..., 'column_19', 'column_20']

df_documents = pd.DataFrame(columns=columns)

for index, document in enumerate(documents):

    dict_document = dict.fromkeys(columns)

    ...
    (parsing the jsons and retrieve the values of the keys and assign them to the dictionary)
    ...

    df_documents = df_documents.append(dict_document, ignore_index=True)

附注

在应用@eumiro的建议后，时间如下：

0-1000 documents -> 0.06 seconds
    1000-2000 documents -> 0.05 seconds
    2000-3000 documents -> 0.05 seconds
    ...
    10000-11000 documents -> 0.05 seconds
    11000-12000 documents -> 0.05 seconds
    ...
    22000-23000 documents -> 0.05 seconds
    23000-24000 documents -> 0.05 seconds
    ...
    34000-35000 documents -> 0.05 seconds
    35000-36000 documents -> 0.05 seconds

在应用@DariuszKrynicki的建议后，时间如下：

0-1000 documents -> 0.56 seconds
1000-2000 documents -> 0.54 seconds
2000-3000 documents -> 0.53 seconds
...
10000-11000 documents -> 0.51 seconds
11000-12000 documents -> 0.51 seconds
...
22000-23000 documents -> 0.51 seconds
23000-24000 documents -> 0.51 seconds
...
34000-35000 documents -> 0.51 seconds
35000-36000 documents -> 0.51 seconds
...

pandas

来源：https://stackoverflow.com/questions/57757580/loop-gets-slower-after-each-iteration

4条答案

按热度按时间

vh0rcniy1#

是的，append在每一个新的行之后都会变慢，因为它必须一次又一次地复制整个（不断增长的）内容。
创建一个简单的列表，添加到列表中，然后在一个步骤中创建一个DataFrame：

records = []

for index, document in enumerate(documents):
    …
    records.append(dict_document)

df_documents = pd.DataFrame.from_records(records)

赞(0）回复(0）举报 2023-06-04

h7appiyu2#

答案可能已经存在于您经常使用的pandas.DataFrame.append方法中。这是非常低效的，因为它需要频繁地分配新的内存，即复制了旧的结果这就能解释你的结果了另请参阅官方pandas.DataFrame.append docs：
迭代地将行追加到DataFrame中可能比单个串联的计算量更大。更好的解决方案是将这些行追加到一个列表中，然后一次性将该列表与原始DataFrame连接起来。
有两个例子：
效率较低：

>>> df = pd.DataFrame(columns=['A'])
>>> for i in range(5): ...     df = df.append({'A': i}, ignore_index=True)
>>> df    A 0  0 1  1 2  2 3  3 4  4

更高效：

>>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)], ...           ignore_index=True)    A 0  0 1  1 2  2 3  3 4  4

您可以应用相同的策略，创建一个 Dataframe 列表，而不是在每次迭代中追加到同一个 Dataframe ，然后在for循环完成后执行concat

赞(0）回复(0）举报 2023-06-04

pxy2qtax3#

我怀疑你的DataFrame会随着每次迭代而增长。如何使用迭代器？

# documents = # json
def get_df_from_json(document):
    columns = ['column_1', 'column_2', ..., 'column_19', 'column_20']
    # parsing the jsons and retrieve the values of the keys and assign them to the dictionary)
    # dict_document =  # use document to parse it and create dictionary
    return pd.DataFrame(list(dict_document.values()), index=dict_document)   

res = (get_df_from_json(document) for document in enumerate(documents))
res = pd.concat(res).reset_index()

编辑：我对下面的例子做了一个快速的比较，结果发现迭代器的使用并没有加速列表解析的用途：

import json
import time

def get_df_from_json():
    dd = {'a': [1, 1], 'b': [2, 2]}
    app_json = json.dumps(dd)
    return pd.DataFrame(list(dd.values()), index=dd)

start = time.time()
res = pd.concat((get_df_from_json() for x in range(1,20000))).reset_index()
print(time.time() - start)

start = time.time()
res = pd.concat([get_df_from_json() for x in range(1,20000)]).reset_index()
print(time.time() - start)

迭代器：9.425999879837036列表解析：8.934999942779541

赞(0）回复(0）举报 2023-06-04

qc6wkl3g4#

这可能会被堆栈溢出的好人删除，但每次我看到一个关于“为什么我的循环变慢了”的问题时，实际上没有人给出答案，是的，你总是可以通过使用不同的代码来加速它们，使用列表代替 Dataframe 等，但根据我的经验，它仍然会变慢，即使没有你可以看到大小增长的对象。我找不到答案。我发现自己每x次迭代就重置一次变量，以便在长时间作业中更快地完成。

赞(0）回复(0）举报 2023-06-04

我来回答

pandas 循环在每次迭代后变慢

4条答案

相关问题

热门标签

最新问答