如何提高pyspark连接的性能

ltqd579y  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(387)

我有两个Dataframe,如下所示:
df1(20m行):

+-----------------+--------------------+--------+----------+                    
|              id |        geolocations|     lat|      long|
+-----------------+--------------------+--------+----------+
|4T1BF1FK1HU376566|kkxyDbypwQ????uGs...|30.60   | -98.39   |
|4T1BF1FK1HU376566|i~nyD~~xvQA??????...|30.55   | -98.27   |
|4T1BF1FK1HU376566|}etyDzqxvQb@Sy@zB...|30.58   | -98.27   |
|JTNB11HK6J3000405|kkxyDbypwQ????uGs...|30.60   | -98.39   |
|JTNB11HK6J3000405|i~nyD~~xvQA??????...|30.55   | -98.27   |

df2(50行):

+---------+-----------+--------------------+
|      lat|       long|               state|
+---------+-----------+--------------------+
|63.588753|-154.493062|              Alaska|
|32.318231| -86.902298|             Alabama|
| 35.20105| -91.831833|            Arkansas|
|34.048928|-111.093731|             Arizona|

我想通过比较df1和df2中的lat long,在df1中得到一个新的列state。从下面的Dataframe来看,lat long上的join将提供零记录,因此我使用了一个阈值,并使用它执行join操作:

threshold = F.lit(3) 
def lat_long_approximation(col1, col2, threshold):
    return F.abs(col1 - col2) < threshold

df3 = df1.join(F.broadcast(df2), lat_long_approximation(df1.lat, df_state.lat, threshold) & lat_long_approximation(df1.long, df_state.long, threshold))

这需要很长时间。有没有人能帮我优化这个连接或者更好的方法来避免使用分离函数(lat\u long\u近似)

acruukt9

acruukt91#

你可以用 between . 我对表演不太确定。

threshold = 10 # for test
df1.join(F.broadcast(df2), 
         df1.lat.between(df2.lat - threshold, df2.lat + threshold) & 
         df1.long.between(df2.long - threshold, df2.long + threshold), "left").show()

相关问题