pandas 优化Panda Dataframe 的操作，当前解决方案速度极慢

k97glaaz 于 2023-02-27 发布在其他

关注(0)|答案(2)|浏览(282)

我有一个以geohash作为索引的Pandas数据框，还有一个叫做neighbors的列，它存储了每个geohash的邻居列表。还有一些其他列也包含了每个geohash的元数据。数据框看起来像这样：
| 土工格栅（索引）|波高|归一化波高|速度系数|邻居|
| - ------|- ------|- ------|- ------|- ------|
| 9赫兹|0.962316美元|0.361604美元|0.757725美元|["u4sj9hy"、"u4sj9kb"、"u4sj9hx"、"u4sj9hw"、"u4sj9k8"等]|
| u4ezqxn|0.570723美元|零点二一四四五七|0.856314美元|["u4ezqxj"、"u4ezqxp"、"u4ezqwy"、"u4ezqwv"、"u4ezqwz"、...|
我需要创建一个edge_list用于图形创建，首先我做了以下操作：

def create_edge_list(geohash, speed_factor, neighbours):
    edge_list = []
    for n in neighbours:
        distance = haversine_distance(geohash, n)
        # distance is in km, speed is in m/s.
        speed = 14 * speed_factor
        time = round((distance/(speed*3.6))*60, 1)
        edge_list.append((geohash, n, {"distance": distance, "time": time}))
    return edge_list

for geohash, row in tqdm(df.iterrows(), desc="Creating edge list", total=len(df.index), colour="green"):
        edge_list = create_edge_list(geohash, row.speed_factor, row.neighbours)
        elist.extend(edge_list)

但是考虑到我有超过700万行，这是非常慢的。然后我尝试使用多处理和多线程尝试了ProcessPoolExecutor和ThreadPoolExecutor，但这些没有太大帮助。有什么建议吗？
编辑：似乎我在ProcessPoolExecutor中有一些错误，一旦我修复了它的工作，它确实加快了速度（从仅仅循环通过几个小时到运行下来花了80分钟）。还做了一个轻微编辑的最小可复制示例（笔记本）

# Using Python 3.11.2, but works fine for most other newer Python versions

!pip install geopandas
!pip install geohash
!pip install polygeohasher
!pip install shapely
!pip install pandas
!pip install geopandas
!pip install tqdm

import os
import random
from math import cos, sin, asin, sqrt, radian

import geohash as gh
from polygeohasher import polygeohasher
from shapely.wkt import loads
import pandas as pd
import geopandas as gpd
from tqdm import tqdm

def haversine_distance(geohash1, geohash2):
    # geohash2 might be a list of neighbors
    if isinstance(geohash2, list):
        return [round(haversine_distance(geohash1, gh), 3) for gh in geohash2]

    lat1, lon1 = gh.decode(geohash1)
    lat2, lon2 = gh.decode(geohash2)

    lat1, lon1 = (float(lat1), float(lon1))
    lat2, lon2 = (float(lat2), float(lon2))

    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles. Determines return value units.
    return c * r

def create_edge_list(geohash, speed_factor, neighbours):
    speed_multiplier = 60 / (3.6 * 14 * speed_factor)
    neighbours = list(neighbours)
    distances = haversine_distance(geohash, neighbours)
    times = [round(d * speed_multiplier, 2) for d in distances]
    edge_list = [(geohash, neighbours[i], {"distance": distances[i], "time": times[i]}) for i in range(len(times))]
    return edge_list

if __name__ == "__main__":
    GEOHASH_PRECISION = 6
    # Create polygons using: https://clydedacruz.github.io/openstreetmap-wkt-playground/
    polygon_wkt = "POLYGON((9.07196044921875 53.91728101547625,8.25897216796875 52.99495027026802,5.88043212890625 53.20603255157843,5.072937011718749 53.497849543967675,5.913391113281249 53.74221377343122,6.05621337890625 54.004540438503625,8.73687744140625 54.072282655603885,9.07196044921875 53.91728101547625))"
    polygon_gdf = gpd.GeoDataFrame(index=[0], crs="EPSG:4326", geometry=[loads(polygon_wkt)])
    print("Creating geohash list...")
    temp_df = polygeohasher.create_geohash_list(polygon_gdf, GEOHASH_PRECISION, inner=True)
    df = pd.DataFrame(temp_df.geohash_list.values.tolist()[0], columns=["geohash"])
    df.set_index("geohash", inplace=True)

    # just simulate some speed factor for now
    df["speed_factor"] =  [random.uniform(0.4, 1.0) for i in range(len(df.index))]

    neighbours = {geohash: gh.neighbors(geohash) for geohash in df.index}
    df["neighbours"] = df.index.map(neighbours)

    elist = []
    MT = False
    print("Creating edge list...")
    if MT:
        from concurrent.futures import ProcessPoolExecutor
        geohash_list = list(df.index)
        speed_factor_list = list(df.speed_factor)
        neighbours_list = list(df.neighbours)
        with tqdm(desc="Creating edge list", total=len(df.index), colour="green") as pbar:
            with ProcessPoolExecutor(os.cpu_count()) as executor:
                result = executor.map(create_edge_list, geohash_list, speed_factor_list, neighbours_list, chunksize=len(df.index)//(os.cpu_count()))
                for edge_list in result:
                    elist.extend(edge_list)
                    pbar.update(1)
    else:
        for geohash, row in tqdm(df.iterrows(), desc="Creating edge list", total=len(df.index), colour="green"):
            edge_list = create_edge_list(geohash, row.speed_factor, row.neighbours)
            elist.extend(edge_list)

pandas

来源：https://stackoverflow.com/questions/75566083/optimize-operation-for-pandas-dataframe-current-solution-is-extremely-slow

2条答案

按热度按时间

bvn4nwqk1#

下面是一种加速方法（对于您的示例，大约是9倍）：
1.将geohash转换为三维（x, y, z）坐标。
1.计算欧氏距离。
1.将这些距离（弦）转换为"大圆"距离。
我们也更喜欢转换不同的geohash，然后Map，而不是转换重复的geohash多次。
顺便说一句，在3D空间中查找邻居会非常方便，例如使用优秀的scipy.spatial.KDTree：你可以设置一个最大距离，它接近于你想要的最大半正矢距离（对于小的距离，它们几乎是相同的，但是欧几里德距离总是小一点，当然），然后过滤找到的最近的邻居。
在任何情况下，只需使用您提供的neihbours：
首先，转换geohash：

R_earth = 6371

def geohash_to_xyz(geoh, R=R_earth):
    lat, lon = np.deg2rad(np.array(pd.Series(geoh).apply(gh.decode).to_list())).T
    
    # conversion (latitude, longitude, altitude) to (x, y, z)
    # see https://stackoverflow.com/a/10788250/758174
    alt = 0  # use altitude = 0
    f = 0  # use flattening = 0
    coslat = np.cos(lat)
    sinlat = np.sin(lat)
    FF     = (1.0 - f)**2
    C      = 1 / np.sqrt(coslat**2 + FF * sinlat**2)
    S      = C * FF

    x = (R * C + alt) * coslat * np.cos(lon)
    y = (R * C + alt) * coslat * np.sin(lon)
    z = (R * S + alt) * sinlat
    
    return np.c_[x, y, z]

示例：

>>> geohash_to_xyz(['u4sj9hz', 'u4sj9hy'])
array([[3164.48891562,  314.70029181, 5520.56289062],
       [3164.49645759,  314.62444382, 5520.56289062]])

然后，我们将弦距离（欧几里得）转换为大圆或"半正矢"：

def chord_to_haversine(d, R=R_earth):
    # say we have two 3D points on a R-radius sphere
    # separated by a Euclidean distance d
    # d = R * crd(alpha)   (the chord times radius)
    # 
    # crd(alpha) = 2 sin(alpha / 2)
    # Thus: alpha = 2 arcsin(1/2 d/R)
    # Haversine (hav) is simply alpha * R
    return R * 2 * np.arcsin(d / (2 * R))

示例：

ha, hb = 'u4sj9hz', 'u4ezqxn'
a, b = geohash_to_xyz([ha, hb])
d = np.linalg.norm(a - b)

>>> chord_to_haversine(d)
36.106372367799615

>>> haversine_distance(ha, hb)
36.106372367799615

现在，我们有了以矢量化方式实现所有内容的所有元素：

def new_edge_list(df, R=R_earth):
    z = df.explode('neighbours')
    geoh = pd.concat([
        df.reset_index()['geohash'],
        z['neighbours'],
    ], ignore_index=True).drop_duplicates()
    pmap = pd.DataFrame(
        geohash_to_xyz(geoh, R),
        columns=list('xyz'),
        index=geoh).rename_axis('geohash', axis=0)
    a = pmap.loc[z.index].to_numpy()
    b = pmap.loc[z['neighbours']].to_numpy()
    d = chord_to_haversine(np.linalg.norm(a - b, axis=1), R)
    t = d * 60 / (3.6 * 14 * z['speed_factor'])
    z = z.assign(distance=d, time=t)

    return z

您提供的MRE示例：

z = new_edge_list(df)

>>> z
         speed_factor neighbours  distance      time
geohash                                             
u1kz2d       0.474744     u1kz26  0.729742  1.829913
u1kz2d       0.474744     u1kz2f  0.729742  1.829913
u1kz2d       0.474744     u1kz29  0.610812  1.531683
u1kz2d       0.474744     u1kz23  0.951674  2.386433
u1kz2d       0.474744     u1kz2c  0.951674  2.386433
...               ...        ...       ...       ...
u1mrp4       0.761640     u1mrnc  0.952250  1.488407
u1mrp4       0.761640     u1mrp3  0.952250  1.488407
u1mrp4       0.761640     u1mrp5  0.610812  0.954725
u1mrp4       0.761640     u1mrng  0.952178  1.488295
u1mrp4       0.761640     u1mrp7  0.952178  1.488295

[375808 rows x 4 columns]

与您的结果比较：

def make_edges(df, show_progress=True):
    elist = []
    it = df.iterrows()
    if show_progress:
        it = tqdm(it, desc="Creating edge list", total=len(df.index), colour="green")
    for geohash, row in it:
        edge_list = create_edge_list(geohash, row['speed_factor'], row['neighbours'])
        elist += edge_list
    return elist

elist = make_edges(df)
>>> elist
[('u1kz2d', 'u1kz26', {'distance': 0.73, 'time': 1.83}),
 ('u1kz2d', 'u1kz2f', {'distance': 0.73, 'time': 1.83}),
 ('u1kz2d', 'u1kz29', {'distance': 0.611, 'time': 1.53}),
 ('u1kz2d', 'u1kz23', {'distance': 0.952, 'time': 2.39}),
 ('u1kz2d', 'u1kz2c', {'distance': 0.952, 'time': 2.39}),
 ...
 ('u1mrp4', 'u1mrnc', {'distance': 0.952, 'time': 1.49}),
 ('u1mrp4', 'u1mrp3', {'distance': 0.952, 'time': 1.49}),
 ('u1mrp4', 'u1mrp5', {'distance': 0.611, 'time': 0.96}),
 ('u1mrp4', 'u1mrng', {'distance': 0.952, 'time': 1.49}),
 ('u1mrp4', 'u1mrp7', {'distance': 0.952, 'time': 1.49})]

附录：将结果转换为元组列表

要将上面的结果转换为所需的精确格式，我们可以执行以下操作：

def to_tuples(z):
    return [
        (ha, hb, {'distance': d, 'time':t})
        for ha, hb, d, t in zip(
            z.index, z['neighbours'],
            np.round(z['distance'].to_numpy(), 3),
            np.round(z['time'].to_numpy(), 2))
    ]

示例：

new_elist = to_tuples(z)
>>> new_elist[:5]
[('u1kz2d', 'u1kz26', {'distance': 0.73, 'time': 1.83}),
 ('u1kz2d', 'u1kz2f', {'distance': 0.73, 'time': 1.83}),
 ('u1kz2d', 'u1kz29', {'distance': 0.611, 'time': 1.53}),
 ('u1kz2d', 'u1kz23', {'distance': 0.952, 'time': 2.39}),
 ('u1kz2d', 'u1kz2c', {'distance': 0.952, 'time': 2.39})]

但是请注意，这可能是最后一步。作为DataFrame，结果更容易操作。例如，您可以在一行程序中过滤距离小于特定距离的行，例如：z.loc[z['distance'] < .9]。或者您可以查看分布（例如z['distance'].hist(bins=50)或z['distance'].quantile([.1,.5,.9])）等。
速度
在给出的MRE中：

t_orig = %timeit -o make_edges(df, show_progress=False)
# 3.83 s ± 8.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

t_new = %timeit -o new_edge_list(df)
# 408 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> t_orig.best / t_new.best
9.40

转换到所需的输出格式也需要一些时间（尽管这比我的初始版本快了3倍，后者更像"Pandas"）：

%timeit to_tuples(z)
# 243 ms ± 857 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

赞(0）回复(0）举报 2023-02-27

4xy9mtcn2#

关于create_edge_list的几点建议：只计算一次乘数，通过处理数组来删除for循环，并使用列表解析来生成输出。

def create_edge_list(geohash, speed_factor, neighbours):
    speed_multiplier = 60 / (3.6 * 14 * speed_factor)  # distance is in km, speed is in m/s.
    distance = haversine_distance(geohash, neighbours)
    time = np.round(distance*speed_multiplier, 1)
    edge_list = [(geohash, neighbours[i], {"distance": distance[i], "time": time[i]})) for i in range(len(time))]
    return edge_list

无法对此进行测试。如果haversine_distance输出2D数组，则可能必须使用distance[0, i]。
至于主循环，你可以预先初始化你的列表，然后动态填充：

elist = [None for _ in range(len(df.index))]
for i, (geohash, row) in tqdm(enumerate(df.iterrows()), desc="Creating edge list", total=len(df.index), colour="green"):
        elist[i] = create_edge_list(geohash, row.speed_factor, row.neighbours)

赞(0）回复(0）举报 2023-02-27

我来回答

pandas 优化Panda Dataframe 的操作，当前解决方案速度极慢

2条答案

附录：将结果转换为元组列表

相关问题

热门标签

最新问答