如何加快跟随Pandas Dataframe 迭代和加锁索引

tuwxkamq  于 2023-02-27  发布在  其他
关注(0)|答案(2)|浏览(101)

下面只是整个数据集的一部分。整个数据集是百万行,所以计算应该是超级快的。在任何情况下,数据看起来如下:
链接到h5文件:https://drive.google.com/file/d/16aI3plRFa3M6nSIiT1XioUIgsPYl1Wg8/view?usp=sharing
我所做的是标准的loc索引

filename="look at the h5 file in the link"
new_centroid_trackings = np.array([[0,0,0,0,0,0,0,0]])
model_name="DLC_resnet50_4mice_new_video_no_wheelFeb17shuffle1_220000"
tracking_coords = pd.read_hdf(filename)

for frame in range(tracking_coords.shape[0]):

    centroid_mouse1_x=(tracking_coords.loc[frame, model_name]["mouse1"]["tail1"]["x"]+tracking_coords.loc[frame, model_name]["mouse1"]["tail2"]["x"]+tracking_coords.loc[frame, model_name]["mouse1"]["tail3"]["x"])/3
    centroid_mouse1_y=(tracking_coords.loc[frame, model_name]["mouse1"]["tail1"]["y"]+tracking_coords.loc[frame, model_name]["mouse1"]["tail2"]["y"]+tracking_coords.loc[frame, model_name]["mouse1"]["tail3"]["y"])/3

    if np.isnan(centroid_mouse1_x) or np.isnan(centroid_mouse1_y):
            centroid_mouse1_y = np.nan
            centroid_mouse1_x = np.nan

    centroid_mouse2_x=(tracking_coords.loc[frame, model_name]["mouse2"]["tail1"]["x"]+tracking_coords.loc[frame, model_name]["mouse2"]["tail2"]["x"]+tracking_coords.loc[frame, model_name]["mouse2"]["tail3"]["x"])/3
    centroid_mouse2_y=(tracking_coords.loc[frame, model_name]["mouse2"]["tail1"]["y"]+tracking_coords.loc[frame, model_name]["mouse2"]["tail2"]["y"]+tracking_coords.loc[frame, model_name]["mouse2"]["tail3"]["y"])/3
    
    if np.isnan(centroid_mouse2_x) or np.isnan(centroid_mouse2_y):
            centroid_mouse2_y = np.nan
            centroid_mouse2_x = np.nan

    centroid_mouse3_x=(tracking_coords.loc[frame, model_name]["mouse3"]["tail1"]["x"]+tracking_coords.loc[frame, model_name]["mouse3"]["tail2"]["x"]+tracking_coords.loc[frame, model_name]["mouse3"]["tail3"]["x"])/3
    centroid_mouse3_y=(tracking_coords.loc[frame, model_name]["mouse3"]["tail1"]["y"]+tracking_coords.loc[frame, model_name]["mouse3"]["tail2"]["y"]+tracking_coords.loc[frame, model_name]["mouse3"]["tail3"]["y"])/3
    
    if np.isnan(centroid_mouse3_x) or np.isnan(centroid_mouse3_y):
            centroid_mouse3_y = np.nan
            centroid_mouse3_x = np.nan

    centroid_mouse4_x=(tracking_coords.loc[frame, model_name]["mouse4"]["tail1"]["x"]+tracking_coords.loc[frame, model_name]["mouse4"]["tail4"]["x"]+tracking_coords.loc[frame, model_name]["mouse4"]["tail3"]["x"])/3
    centroid_mouse4_y=(tracking_coords.loc[frame, model_name]["mouse4"]["tail1"]["y"]+tracking_coords.loc[frame, model_name]["mouse4"]["tail4"]["y"]+tracking_coords.loc[frame, model_name]["mouse4"]["tail3"]["y"])/3
    
    if np.isnan(centroid_mouse4_x) or np.isnan(centroid_mouse4_y):
            centroid_mouse4_y = np.nan
            centroid_mouse4_x = np.nan

# now concatinate the centroids to the previous ones

    new_centroid_trackings=np.concatenate((new_centroid_trackings, np.array([[centroid_mouse1_x,centroid_mouse1_y,centroid_mouse2_x, centroid_mouse2_y, centroid_mouse3_x, centroid_mouse3_y, centroid_mouse4_x, centroid_mouse4_y]])), axis=0)

为此,7500行需要大约90秒。
现在我的想法是用一个numpy数组来代替panda Dataframe ,或者有没有其他更快的方法来加速计算?

balp4ylt

balp4ylt1#

首先,让我们从简化问题开始,如果x或y中有一个是NaN,那么你就把两者都设置为NaN,但这种情况从来没有发生过:在数据集中的任何地方,如果其中一个变量是NaN,那么两个变量都是NaN。因此,可以删除此检查。
接下来,我们将使用Pandas索引和numpy执行此平均。

model = "DLC_resnet50_4mice_new_video_no_wheelFeb17shuffle1_220000"
mouse = ["mouse1", "mouse2", "mouse3", "mouse4"]
bodyparts = ["tail1", "tail2", "tail3"]
coords = ["x", "y"]
array = tracking_coords.loc[:, (model, mouse, bodyparts, coords)].values

Pandas允许loc对多个列和多个索引进行索引,这里我没有过滤掉任何行(:是通配符),但我过滤掉了一些列,例如,这将删除其他身体部位的似然性列和坐标。
像这样复制整个列非常快。
然后,它被转换成一个numpy数组,末尾为.values

nrows = array.shape[0]
ncols = array.shape[1]
# Check that all cols are present
assert ncols == len(mouse) * len(coords) * len(bodyparts)
# Reshape
# Axes are observation, mouse, bodypart, then coordinate
array.shape = (nrows, len(mouse), len(bodyparts), len(coords))

现在numpy数组被重新整形了,这个操作非常快,因为它不修改数据--它修改的是用于索引数据的索引,这是一个常数时间操作。

# Average readings across bodyparts, which is axis 2
array = array.mean(axis=2)

这就是整形的关键点,因为我们要平均的数据位于一个轴上,所以它可以在一次矢量化操作中处理。

# Flatten inner dimensions
array = array.reshape(nrows, 8)

这对于将数据放在与示例相同的形状中是必要的。如果需要,可以跳过这一行;如果你这样做,形状将是(?,4,2),第一个轴是时间,第二个是鼠标数,第三个是坐标,x或y。
根据我的测试,这需要10毫秒来处理数据,我将它与问题中的实现所产生的结果进行了比较,结果是匹配的,但有两点需要注意:
1.你的版本是以一行零开头的,但我没有。
1.我修正了你原来的实现中的一些东西,在我看来像是一个bug。(将tail 4更改为tail 2。)

krcsximq

krcsximq2#

您可以尝试类似下面的操作,

idx = pd.IndexSlice
mouses = ['mouse1', 'mouse2', 'mouse3', 'mouse4']
for frame in range(tracking_coords.shape[0]):
    centroids = np.zeros([4, 2])
    df = tracking_coords.loc[frame, model_name]
    for (n, mouse) in enumerate(mouses):

        # for each mouse calculate the [centroid_x, centroid_y] values

        centroids[n] = [(df.loc[idx[mouse], :, idx['x']])[-4:-1].mean(), 
                        (df.loc[idx[mouse], :,idx['y']])[-4:-1].mean()]
        if np.isnan(np.prod(centroids[n]):
            centroids[n] = [np.nan, np.nan]

        # Do something here with these centroid values

上面的代码迭代9000+列大约需要20s。我希望这对你有帮助。

相关问题