如何计算和存储pysparkDataframe列中项目的频率?

z31licg0  于 2021-05-16  发布在  Spark
关注(0)|答案(1)|浏览(418)

我有一个数据集

simpleData = [("person1","city1"), \
    ("person1","city2"), \
    ("person1","city1"), \
    ("person1","city3"), \
    ("person1","city1"), \
    ("person2","city3"), \
    ("person2","city2"), \
    ("person2","city3"), \
    ("person2","city3") \
  ]
columns= ["persons_name","city_visited"]
exp = spark.createDataFrame(data = simpleData, schema = columns)

exp.printSchema()
exp.show()

看起来像这样-

root
 |-- persons_name: string (nullable = true)
 |-- city_visited: string (nullable = true)

+------------+------------+
|persons_name|city_visited|
+------------+------------+
|     person1|       city1|
|     person1|       city2|
|     person1|       city1|
|     person1|       city3|
|     person1|       city1|
|     person2|       city3|
|     person2|       city2|
|     person2|       city3|
|     person2|       city3|
+------------+------------+

现在我想创建n个新列,其中n是一个名为“city\u visited”的列中唯一项的数量,这样它就可以保存所有人的所有唯一项的频率。输出应该如下所示-

+------------+-----+-----+-----+
|persons_name|city1|city2|city3|
+------------+-----+-----+-----+
|     person1|    3|    1|    1|
|     person2|    0|    1|    3|
+------------+-----+-----+-----+

我怎样才能做到这一点?

wixjitnu

wixjitnu1#

pivot 之后 groupBy :

exp.groupBy('persons_name').pivot('city_visited').count()

如果要0而不是 null :

exp.groupBy('persons_name').pivot('city_visited').count().fillna(0)

如果你想按 persons_name ,追加 .orderBy('persons_name') 到查询。

相关问题