pyspark使用dataframes进行子选择/子查询连接

ffscu2ro  于 2021-07-13  发布在  Spark
关注(0)|答案(1)|浏览(372)

我希望加入一个值的基础上最接近的匹配低于该值。在sql中,我可以很容易地做到这一点。考虑以下数据:
乳杆菌

|Date       |Temperature:
|09/02/2020 |14.1
|10/02/2020 |15.3
|11/02/2020 |12.2
|12/02/2020 |12.4
|13/02/2020 |12.5
|14/02/2020 |11
|15/02/2020 |14.6

TBL效率:

|Metric |Coefficient
|10.5   |0.997825593
|11     |0.997825593
|11.5   |0.997663198
|12     |0.997307614
|12.5   |0.996848773
|13     |0.996468537
|13.5   |0.99638519
|14     |0.996726301
|14.5   |0.997435894
|15     |0.998311153
|15.5   |0.999135509

在sql中,我可以通过以下方法实现连接:

Select 
    a.date, 
    b.temperature, 
    (select top 1 b.Coefficient from tblCoefficients b where b.Metric <= a.Temperature order by b.Metric desc) as coefficient 
from tblActuals

在两个pysparkDataframe中的数据是否有任何方法可以达到上述效果?我可以在sparksql中获得类似的结果,但是我需要在databricks中创建的过程的dataframes的灵活性。

rjee0c15

rjee0c151#

您可以执行联接并获取最大(最近)度量的系数:

import pyspark.sql.functions as F

result = tblActuals.join(
    tblCoefficients,
    tblActuals['Temperature'] >= tblCoefficients['Metric']
).groupBy(tblActuals.columns).agg(
    F.max(F.struct('Metric', 'Coefficient'))['Coefficient'].alias('coefficient')
)

result.show()
+----------+-----------+-----------+
|      Date|Temperature|coefficient|
+----------+-----------+-----------+
|15/02/2020|       14.6|0.997435894|
|12/02/2020|       12.4|0.997307614|
|14/02/2020|       11.0|0.997825593|
|13/02/2020|       12.5|0.996848773|
|11/02/2020|       12.2|0.997307614|
|10/02/2020|       15.3|0.998311153|
|09/02/2020|       14.1|0.996726301|
+----------+-----------+-----------+

相关问题