如何使用NumPy或Pandas来加速操作

kgsdhlau 于 2023-10-19 发布在其他

关注(0)|答案(1)|浏览(93)

我有这样的代码，其中一个对象引用1个或多个其他对象，这些对象具有与此相关的分数。Python中的代码非常慢，我想加快它的速度。我认为这是一个非常普遍的问题，所以我不想做“错”。这个问题叫什么，所以我可以谷歌找到例子有效的实现？

import random
import secrets
import time
import math

# generate fake data

NUM_WORDS = 10000
NUM_SENTENCES = 10000
MAX_SENTENCE_LENGTH = 25

# these will be updated between each sampling
word_gain = [math.exp(4*random.random()) for i in range(NUM_WORDS)]
word_cost = [random.random() for i in range(NUM_WORDS)]
sentences = {}

# these will remain the same through each iteration
for i in range(NUM_SENTENCES):
    sentences[i] = [secrets.randbelow(NUM_WORDS) for i in range(3 + secrets.randbelow(MAX_SENTENCE_LENGTH - 3))]

start_time = time.time()

MAX_COST = 6.0

for j in range(50):

    best_index, best_value = -1, 0.0;
    for i in range(NUM_SENTENCES):
        cur_cost, cur_gain = 0.0, 0.0

        for j in sentences[i]:
            cur_cost += word_cost[j]
            cur_gain += word_gain[j]

        if cur_cost <= MAX_COST and cur_gain > best_value:
            best_index, best_value = i, cur_gain
    assert best_value > 0.0
    print(f"Best {best_index} {best_value}")

    # adjust scores based on find
    for j in sentences[best_index]:
        word_gain[j] = -0.01
        word_cost[j] /= 2

end_time = time.time()

print(f"Elapsed time {end_time - start_time}")

numpy

来源：https://stackoverflow.com/questions/77010992/how-to-use-numpy-or-pandas-to-speed-up-operation

1条答案

按热度按时间

9jyewag01#

首先，你需要将word_cost，word_gain以及sentences[i]（对于所有i）转换为Numpy数组：

# Use np.float32 if you do not care about having a high precision
# so it can be faster
word_cost_np = np.array(word_cost, dtype=np.float64)
word_gain_np = np.array(word_gain, dtype=np.float64)
sentences_np = {k: np.array(sentences[k], dtype=np.int32) for k in sentences}

然后你可以使用以下命令 vectorize 内部循环：

cur_cost = word_cost_np[sentences_np[i]].sum()
cur_gain = word_gain_np[sentences_np[i]].sum()

第二个内部循环也可以使用相同的策略进行向量化：

word_gain_np[sentences_np[best_index]] = -0.01
word_cost_np[sentences_np[best_index]] /= 2

请注意，sentences[best_index]需要没有重复的项目，因为Numpy不能向量化这样的东西。坏消息是实际上有重复。这可以通过使用以下命令来实现：

[np.sort(sentences_np[k]) for k in sentences_np if np.unique(sentences_np[k]).size != sentences_np[k].size]

此外，请注意Numpy计算速度较慢（至少在我的机器上）。这是因为与计算它们的时间相比，Numpy为小数组引入了显著的开销。
一般来说，这个想法是向量化外部循环，以便对更大的数组进行操作。但是，这是不可能的，因为**字典。可变大小的数组也不支持Numpy。可以将dict扁平化到一个大数组中并对Map进行编码，但这不是使用Numpy的常见方式。
最后，我不认为Numpy一个人适合这样的任务。Numba可以解决上述问题。然而，dict在Numba码中并不容易使用。此外，请注意，Pandas与可变大小的数组有相同的问题（这并不奇怪，因为Pandas是建立在Numpy之上的）。也许Awkward可以解决这个问题，因为它支持可变大小的数组。尽管如此，重复仍然是一个问题与尴尬。将Numba与Awkward结合起来是可能的，但对于这样的任务来说并不简单。

赞(0）回复(0）举报 2023-10-19

我来回答

如何使用NumPy或Pandas来加速操作

1条答案

相关问题

热门标签

最新问答