我有一个以geohash作为索引的Pandas数据框,还有一个叫做neighbors的列,它存储了每个geohash的邻居列表。还有一些其他列也包含了每个geohash的元数据。数据框看起来像这样:
| 土工格栅(索引)|波高|归一化波高|速度系数|邻居|
| - ------|- ------|- ------|- ------|- ------|
| 9赫兹|0.962316美元|0.361604美元|0.757725美元|["u4sj9hy"、"u4sj9kb"、"u4sj9hx"、"u4sj9hw"、"u4sj9k8"等]|
| u4ezqxn|0.570723美元|零点二一四四五七|0.856314美元|["u4ezqxj"、"u4ezqxp"、"u4ezqwy"、"u4ezqwv"、"u4ezqwz"、...|
我需要创建一个edge_list用于图形创建,首先我做了以下操作:
def create_edge_list(geohash, speed_factor, neighbours):
edge_list = []
for n in neighbours:
distance = haversine_distance(geohash, n)
# distance is in km, speed is in m/s.
speed = 14 * speed_factor
time = round((distance/(speed*3.6))*60, 1)
edge_list.append((geohash, n, {"distance": distance, "time": time}))
return edge_list
for geohash, row in tqdm(df.iterrows(), desc="Creating edge list", total=len(df.index), colour="green"):
edge_list = create_edge_list(geohash, row.speed_factor, row.neighbours)
elist.extend(edge_list)
但是考虑到我有超过700万行,这是非常慢的。然后我尝试使用多处理和多线程尝试了ProcessPoolExecutor和ThreadPoolExecutor,但这些没有太大帮助。有什么建议吗?
编辑:似乎我在ProcessPoolExecutor中有一些错误,一旦我修复了它的工作,它确实加快了速度(从仅仅循环通过几个小时到运行下来花了80分钟)。还做了一个轻微编辑的最小可复制示例(笔记本)
# Using Python 3.11.2, but works fine for most other newer Python versions
!pip install geopandas
!pip install geohash
!pip install polygeohasher
!pip install shapely
!pip install pandas
!pip install geopandas
!pip install tqdm
import os
import random
from math import cos, sin, asin, sqrt, radian
import geohash as gh
from polygeohasher import polygeohasher
from shapely.wkt import loads
import pandas as pd
import geopandas as gpd
from tqdm import tqdm
def haversine_distance(geohash1, geohash2):
# geohash2 might be a list of neighbors
if isinstance(geohash2, list):
return [round(haversine_distance(geohash1, gh), 3) for gh in geohash2]
lat1, lon1 = gh.decode(geohash1)
lat2, lon2 = gh.decode(geohash2)
lat1, lon1 = (float(lat1), float(lon1))
lat2, lon2 = (float(lat2), float(lon2))
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles. Determines return value units.
return c * r
def create_edge_list(geohash, speed_factor, neighbours):
speed_multiplier = 60 / (3.6 * 14 * speed_factor)
neighbours = list(neighbours)
distances = haversine_distance(geohash, neighbours)
times = [round(d * speed_multiplier, 2) for d in distances]
edge_list = [(geohash, neighbours[i], {"distance": distances[i], "time": times[i]}) for i in range(len(times))]
return edge_list
if __name__ == "__main__":
GEOHASH_PRECISION = 6
# Create polygons using: https://clydedacruz.github.io/openstreetmap-wkt-playground/
polygon_wkt = "POLYGON((9.07196044921875 53.91728101547625,8.25897216796875 52.99495027026802,5.88043212890625 53.20603255157843,5.072937011718749 53.497849543967675,5.913391113281249 53.74221377343122,6.05621337890625 54.004540438503625,8.73687744140625 54.072282655603885,9.07196044921875 53.91728101547625))"
polygon_gdf = gpd.GeoDataFrame(index=[0], crs="EPSG:4326", geometry=[loads(polygon_wkt)])
print("Creating geohash list...")
temp_df = polygeohasher.create_geohash_list(polygon_gdf, GEOHASH_PRECISION, inner=True)
df = pd.DataFrame(temp_df.geohash_list.values.tolist()[0], columns=["geohash"])
df.set_index("geohash", inplace=True)
# just simulate some speed factor for now
df["speed_factor"] = [random.uniform(0.4, 1.0) for i in range(len(df.index))]
neighbours = {geohash: gh.neighbors(geohash) for geohash in df.index}
df["neighbours"] = df.index.map(neighbours)
elist = []
MT = False
print("Creating edge list...")
if MT:
from concurrent.futures import ProcessPoolExecutor
geohash_list = list(df.index)
speed_factor_list = list(df.speed_factor)
neighbours_list = list(df.neighbours)
with tqdm(desc="Creating edge list", total=len(df.index), colour="green") as pbar:
with ProcessPoolExecutor(os.cpu_count()) as executor:
result = executor.map(create_edge_list, geohash_list, speed_factor_list, neighbours_list, chunksize=len(df.index)//(os.cpu_count()))
for edge_list in result:
elist.extend(edge_list)
pbar.update(1)
else:
for geohash, row in tqdm(df.iterrows(), desc="Creating edge list", total=len(df.index), colour="green"):
edge_list = create_edge_list(geohash, row.speed_factor, row.neighbours)
elist.extend(edge_list)
2条答案
按热度按时间bvn4nwqk1#
下面是一种加速方法(对于您的示例,大约是9倍):
1.将
geohash
转换为三维(x, y, z
)坐标。1.计算欧氏距离。
1.将这些距离(弦)转换为"大圆"距离。
我们也更喜欢转换不同的geohash,然后Map,而不是转换重复的geohash多次。
顺便说一句,在3D空间中查找邻居会非常方便,例如使用优秀的
scipy.spatial.KDTree
:你可以设置一个最大距离,它接近于你想要的最大半正矢距离(对于小的距离,它们几乎是相同的,但是欧几里德距离总是小一点,当然),然后过滤找到的最近的邻居。在任何情况下,只需使用您提供的
neihbours
:首先,转换geohash:
示例:
然后,我们将弦距离(欧几里得)转换为大圆或"半正矢":
示例:
现在,我们有了以矢量化方式实现所有内容的所有元素:
您提供的MRE示例:
与您的结果比较:
附录:将结果转换为元组列表
要将上面的结果转换为所需的精确格式,我们可以执行以下操作:
示例:
但是请注意,这可能是最后一步。作为DataFrame,结果更容易操作。例如,您可以在一行程序中过滤距离小于特定距离的行,例如:
z.loc[z['distance'] < .9]
。或者您可以查看分布(例如z['distance'].hist(bins=50)
或z['distance'].quantile([.1,.5,.9])
)等。速度
在给出的MRE中:
转换到所需的输出格式也需要一些时间(尽管这比我的初始版本快了3倍,后者更像"Pandas"):
4xy9mtcn2#
关于
create_edge_list
的几点建议:只计算一次乘数,通过处理数组来删除for循环,并使用列表解析来生成输出。无法对此进行测试。如果haversine_distance输出2D数组,则可能必须使用
distance[0, i]
。至于主循环,你可以预先初始化你的列表,然后动态填充: