pyspark联接左,无重复项

mtb9vblg  于 2023-01-16  发布在  Spark
关注(0)|答案(1)|浏览(171)

我已经离开pyspark df:

+----------+----------+
|session_id|time      |
+----------+----------+
|1         |10        |
|2         |20        |
|3         |30        |

右边:

+----------+----------+
|res_id    |sess_id   |
+----------+----------+
|1         |1         |
|2         |2         |  
|1         |1         |

我需要接收:

+----------+---------+----------+
|res_id    |sess_id  | time     |
+----------+---------+----------+
|1         |1        |  10      |
|2         |2        |  20      |
|1         |1        |  10      |

如何实现?左/内连接复制了我的res_id记录....
谢谢你,

y1aodyip

y1aodyip1#

左/内连接正在复制我的res_id记录....
也许你分享你的代码会有帮助?
这似乎做你需要的:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

left = spark.createDataFrame(
    [
        {"session_id": 1, "time": 10},
        {"session_id": 2, "time": 20},
        {"session_id": 3, "time": 30},
    ]
)

right = spark.createDataFrame(
    [
        {"res_id": 1, "sess_id": 1},
        {"res_id": 2, "sess_id": 2},
        {"res_id": 3, "sess_id": 1},
    ]
)

(
    left.join(right, left.session_id == right.sess_id).select(
        "res_id", "sess_id", "time"
    )
).show()

其输出:

+------+-------+----+                                                           
|res_id|sess_id|time|  
+------+-------+----+  
|     1|      1|  10|  
|     3|      1|  10|  
|     2|      2|  20|  
+------+-------+----+

这与上面所需的输出相同。

相关问题