连接名称的值计数

ufj5ltwl  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(307)

我想转换这个PyparkDataframe:

df = spark.createDataFrame([
  ("A", 1),
  ("A", 2), 
  ("A", 3),
  ("B", 1),
  ("B", 2),
  ("B", 4), 
  ("B", 5)
],
  ["name", "connect"]
)

df.show()
+----+-------+
|name|connect|
+----+-------+
|   A|      1|
|   A|      2|
|   A|      3|
|   B|      1|
|   B|      2|
|   B|      4|
|   B|      5|
+----+-------+

转换为以下格式:

df_out = spark.createDataFrame([
  ("A", "A", 3),
  ("B", "B", 4), 
  ("A", "B", 2)
],
  ["name1", "name2", "n_connect"]
)

df_out.show()
+-----+-----+---------+
|name1|name2|n_connect|
+-----+-----+---------+
|    A|    A|        3|
|    B|    B|        4|
|    A|    B|        2|
+-----+-----+---------+

i、 我想知道每个名字有多少个“连接”,我想知道每个名字之间有多少个共享的“连接”。spark中有什么标准函数允许我这样做吗?

wlzqhblo

wlzqhblo1#

你可以做一个自连接,合并相同的组合,即a->b和b->a,然后countdistinct connect 对于每个组合。下面我们用 sort_array(array(d1.name, d2.name)) 要对唯一的名称组合进行分组,请执行以下操作:

from pyspark.sql.functions import countDistinct

df_new = df.alias("d1").join(df.alias("d2"), "connect") \
    .selectExpr("sort_array(array(d1.name, d2.name)) as names", "d1.connect") \
    .groupby("names") \
    .agg(countDistinct("connect").alias("n_connect"))
+------+---------+
| names|n_connect|
+------+---------+
|[A, A]|        3|
|[B, B]|        4|
|[A, B]|        2|
+------+---------+

df_new.selectExpr("names[0] as name1", "names[1] as name2", "n_connect").show()
+-----+-----+---------+
|name1|name2|n_connect|
+-----+-----+---------+
|    A|    A|        3|
|    B|    B|        4|
|    A|    B|        2|
+-----+-----+---------+

你可以用Pandas做类似的事情:

pdf = df.toPandas()
pdf.merge(pdf, on="connect") \
    .assign(names=lambda x: [tuple(sorted(z)) for z in zip(x.name_x, x.name_y)]) \
    .groupby('names')["connect"].nunique()

# Out[*]:

# names

# (A, A)    3

# (A, B)    2

# (B, B)    4

根据@anky的建议,使用np.sort()对名称进行排序:

import numpy as np
names = ["name_x", "name_y"]
pdf1 = pdf.merge(pdf, on="connect")
pdf1[names] = np.sort(pdf1[names],1)
pdf1.groupby(names)["connect"].nunique().reset_index()

# name_x name_y  connect

# 0      A      A        3

# 1      A      B        2

# 2      B      B        4

相关问题