我使用以下代码创建了Dataframe:
df = spark.createDataFrame([("A", "20"), ("B", "30"), ("D", "80"),("A", "120"),("c", "20"),("Null", "20")],["Let", "Num"])
df.show()
+----+---+
| Let|Num|
+----+---+
| A| 20|
| B| 30|
| D| 80|
| A|120|
| c| 20|
|Null| 20|
+----+---+
我想创建如下Dataframe:
+----+-------+
| Let|Num |
+----+-------+
| A| 20,120|
| B| 30 |
| D| 80 |
| c| 20 |
|Null| 20 |
+----+-------+
如何做到这一点?
1条答案
按热度按时间1szpjjfi1#
你可以
groupBy
以列表形式出租和收集collect_list
```from pyspark.sql import functions as F
df.groupBy("Let").agg(F.collect_list("Num")).show()
+----+-----------------+
| Let|collect_list(Num)|
+----+-----------------+
| B| [30]|
| D| [80]|
| A| [20, 120]|
| c| [20]|
|Null| [20]|
+----+-----------------+
df.groupBy("Let").agg(F.concat_ws(",", F.collect_list("Num"))).show()
+----+-------------------------------+
| Let|concat_ws(,, collect_list(Num))|
+----+-------------------------------+
| B| 30|
| D| 80|
| A| 20,120|
| c| 20|
|Null| 20|
+----+-------------------------------+