我有一个数据集
simpleData = [("person1","city1"), \
("person1","city2"), \
("person1","city1"), \
("person1","city3"), \
("person1","city1"), \
("person2","city3"), \
("person2","city2"), \
("person2","city3"), \
("person2","city3") \
]
columns= ["persons_name","city_visited"]
exp = spark.createDataFrame(data = simpleData, schema = columns)
exp.printSchema()
exp.show()
看起来像这样-
root
|-- persons_name: string (nullable = true)
|-- city_visited: string (nullable = true)
+------------+------------+
|persons_name|city_visited|
+------------+------------+
| person1| city1|
| person1| city2|
| person1| city1|
| person1| city3|
| person1| city1|
| person2| city3|
| person2| city2|
| person2| city3|
| person2| city3|
+------------+------------+
现在我想创建n个新列,其中n是一个名为“city\u visited”的列中唯一项的数量,这样它就可以保存所有人的所有唯一项的频率。输出应该如下所示-
+------------+-----+-----+-----+
|persons_name|city1|city2|city3|
+------------+-----+-----+-----+
| person1| 3| 1| 1|
| person2| 0| 1| 3|
+------------+-----+-----+-----+
我怎样才能做到这一点?
1条答案
按热度按时间wixjitnu1#
pivot
之后groupBy
:如果要0而不是
null
:如果你想按
persons_name
,追加.orderBy('persons_name')
到查询。