我有一个名为“zone”的DF,其中x
和y
列为整数,可以解释为点的位置。我需要计算第一个和第二个邻居的数量,我写了这个:
import numpy as np
import pandas as pd
data = np.random.randint(1000,6000,size=(600000,2))
zone = pd.DataFrame(data, columns=['x', 'y']).drop_duplicates()
a=[]
for i,row in zone.iterrows():
x = row.x
y = row.y
num_1st_neigh = len(zone[(zone.x>=(x-1))&(zone.x<=(x+1))&(zone.y>=(y-1))&(zone.y<=(y+1))])-1
num_2nd_neigh = (len(zone[(zone.x>=(x-2))&(zone.x<=(x+2))&(zone.y>=(y-2))&(zone.y<=(y+2))])-1)\
-(num_1st_neigh)
a.append([i,num_1st_neigh,num_2nd_neigh])
a = pd.DataFrame(a, columns = ['index','num_1st_neigh','num_2nd_neigh'])
zzz = zone.reset_index().merge(a,on='index')
这工作很好,但持续15秒的3 K点,我有1 M点,它的stll运行后2小时。有什么想法可以提高执行速度吗?
我读到过iterrows非常慢,但我不知道我还能怎么做。
编辑:我也用SQL做了同样的尝试,但执行时间> 2 h,查询返回一个超时:
SELECT t0.x,
t0.y,
count_if(greatest(abs(t0.x-t1.x), abs(t0.y-t1.y)) = 1) num_1_neighbors,
count_if(greatest(abs(t0.x-t1.x), abs(t0.y-t1.y)) = 2) num_2_neighbors
FROM "table" t0
left join "table" t1 on t1.x between t0.x -2 and t0.x + 2
and t1.y between t0.y -2 and t0.y + 2
and (
t1.x <> t0.x
or t1.y <> t0.y
)
group by 1,2
任何使用SQL或pandas的想法都非常受欢迎
2条答案
按热度按时间r7knjye21#
您可以从
sklearn
使用BallTree
:举个小例子:
j2cgzkjk2#
然后:
这将打印:
基准:
图纸:
对于
data = np.random.randint(1000,6000,size=(1_000_000,2))
,空间哈希Map版本在44.175860428018495
秒内返回(在我的AMD Ryzen 5700 X/Python 3.10机器上)