pandas 如何提高在df中搜索列表的速度?

wlwcrazw  于 2023-08-01  发布在  其他
关注(0)|答案(1)|浏览(105)

我有一个 Dataframe ,其中有一列地理坐标列表(按它们所属的输电线路分组)和 Dataframe 外的另一个坐标列表。如果样本编号为DS_3,我需要找到 Dataframe 中每个列表坐标的位置,但由于几乎有6000个坐标,这需要花费太长的时间。我敢打赌,有一种向量化的方法可以做到这一点,因为我已经了解到循环遍历 Dataframe 是不切实际的,但我对向量方法了解不多。
下面是 Dataframe :

id  coordinates                                 voltage   length    voltage x length
0   36  [[-81.443569, 28.470022], [-81.446726, 28.4740...   230 2788.450481 6.413436e+05
1   69  [[-82.402208, 27.907588], [-82.406592, 27.9084...   69  1486.634968 1.025778e+05
2   87  [[-80.38392, 25.748665], [-80.383758, 25.74358...   69  3395.795388 2.343099e+05
3   128 [[-81.423956, 28.410278], [-81.424811, 28.4053...   69  5231.189711 3.609521e+05
4   138 [[-81.843314, 30.572359], [-81.844404, 30.5685...   230 2716.984353 6.249064e+05
... ... ... ... ... ...
3061    68184   [[-81.251491, 28.53718], [-81.250396, 28.53283...   69  19243.512450    1.327802e+06
3062    68189   [[-82.669886, 28.961533], [-82.664782, 28.9615...   230 27463.901761    6.316697e+06
3063    68196   [[-81.157196, 29.000982], [-81.157041, 28.9958...   500 90524.038042    4.526202e+07
3064    68199   [[-80.549594, 28.481094], [-80.551733, 28.4857...   115 7185.881445 8.263764e+05
3065    68211   [[-80.44025, 25.81403]] 115 673.881802  7.749641e+04

字符串
下面是坐标列表的开始:

[[-81.708274, 31.095992], [-81.708763, 31.090911], [-81.709349, 31.085841], [-81.710002, 31.080779], [-81.710627, 31.075713], [-81.711167, 31.070638], [-81.711649, 31.065557], [-81.712316, 31.060497], [-81.713036, 31.055444], [-81.713757, 31.050391], [-81.714478, 31.045338], [-81.715199, 31.040285], [-82.184384, 31.058297], [-82.188367, 31.061488], [-82.192045, 31.065027], [-82.195735, 31.068554], [-82.199426, 31.072079], [-82.20315, 31.075567], [-82.207127, 31.078767], [-82.211101, 31.081969], [-82.215077, 31.08517], [-82.219057, 31.088366], [-82.223033, 31.091567], [-82.227002, 31.094776], [-82.230978, 31.097977], [-82.234959, 31.101171], [-82.238934, 31.104373], [-82.242912, 31.107571], [-82.24689, 31.110769], [-82.250862, 31.113975], [-82.255483, 31.116065], [-82.260227, 31.117947], [-82.26497, 31.119832], [-82.269711, 31.121722], [-82.274457, 31.123602], [-82.279199, 31.125489], [-82.283947, 31.127364], [-82.28869, 31.129249], [-82.293435, 31.131131], [-82.298182, 31.133006], [-82.30292, 31.134905] ...


这是我目前使用的代码;任何提示将不胜感激!

for index, row in df.iterrows():
    for i in range(len(coordinates)):
        if coordinates[i] in row['coordinates'] and sample[i]['sample_0'] == 'DS_3':
                    if row['id'] in damaged_lines:
                        break
                    else:
                        damaged_lines.append(row['id'])


更新代码:

for i in range(len(coordinates)):
    for index, row in df.iterrows():
        if coordinates[i] in row['coordinates'] and sample[i]['sample_0'] == 'DS_3':
                    if row['id'] in damaged_lines:
                        break
                    else:
                        damaged_lines.append(row['id']) 
                        break

4c8rllxm

4c8rllxm1#

下面是一个如何通过二进制搜索实现这一点的示例。这不会检查'DS_3',但添加起来很容易。我生成了100行随机数量的坐标对。然后,我从中提取一个坐标列表沿着它们来自的行的索引。然后,我从中提取一个随机样本进行查找,然后对提取的样本进行简单的循环,对列表进行二分搜索。

import os
import numpy
import bisect
import random
import pandas as pd

# Generate 100 random rows.

rows = []
for _ in range(100):
    n = random.randrange(2,5)
    rows.append( [[random.random(),random.random()] for _ in range(n)] )

# Turn that into a dataframe.

ids = list(range(100))
df = pd.DataFrame( { 'id': ids, 'coordinates': rows } )

# Now construct a single index.

allvals = []
for idx,row in df.iterrows():
    for v in row['coordinates']:
        allvals.append( (v, row['id']) )

# Pick an assortment for searching.

coordinates = [i[0] for i in random.choices(allvals,k=15)]

# Sort the index.

allvals.sort()

for v in coordinates:
    i = bisect.bisect_left(allvals, (v,0))
    print(v, "found with index", allvals[i] )

字符串
输出量:

[0.11420241950096588, 0.3012171391829337] found with index ([0.11420241950096588, 0.3012171391829337], 26)
[0.5291298973513743, 0.7171475026345056] found with index ([0.5291298973513743, 0.7171475026345056], 46)
[0.37665801783339126, 0.9637006925267728] found with index ([0.37665801783339126, 0.9637006925267728], 38)
[0.470318651764595, 0.645338416806701] found with index ([0.470318651764595, 0.645338416806701], 42)
[0.1058926573158816, 0.5197311227746729] found with index ([0.1058926573158816, 0.5197311227746729], 44)
[0.47475506322839045, 0.5079794380642515] found with index ([0.47475506322839045, 0.5079794380642515], 12)
[0.36731827174930154, 0.27873066502693755] found with index ([0.36731827174930154, 0.27873066502693755], 19)
[0.16012365898638536, 0.9036887843952794] found with index ([0.16012365898638536, 0.9036887843952794], 39)
[0.5443093669600313, 0.014991094298460905] found with index ([0.5443093669600313, 0.014991094298460905], 96)
[0.03875984282281675, 0.13183885631959258] found with index ([0.03875984282281675, 0.13183885631959258], 76)
[0.48366667699166943, 0.40255967639358115] found with index ([0.48366667699166943, 0.40255967639358115], 63)
[0.1743226390424386, 0.7617040167975732] found with index ([0.1743226390424386, 0.7617040167975732], 20)
[0.9862858775797206, 0.5949703957196504] found with index ([0.9862858775797206, 0.5949703957196504], 16)
[0.9475511012671365, 0.27111055784024374] found with index ([0.9475511012671365, 0.27111055784024374], 34)
[0.9686244585578948, 0.8454092861137659] found with index ([0.9686244585578948, 0.8454092861137659], 72)

相关问题