在Spark中合并关联项目

ippsafx7 于 2022-12-19 发布在 Apache

关注(0)|答案(3)|浏览(129)

在Spark中，我有一个很大的元素列表（数百万），其中包含彼此关联的项。
1: ("A", "C", "D") #此数组中的每个项都与数组中的任何其他元素相关联，因此A和C相关联，A和D相关联，C和D相关联。
2: ("F", "H", "I", "P")
3: ("H", "I", "D")
4: ("X", "Y", "Z")
我想执行一个操作，在存在跨列表关联的情况下合并关联。在上面的示例中，我们可以看到前三行的所有项目都相互关联（第1行和第2行应该组合，因为根据第3行，D和I是关联的）。因此，输出应为：
("A", "C", "D", "F", "H", "I", "P")
("X", "Y", "Z")
在Spark中我可以使用什么类型的转换来执行这个操作？我看了各种各样的数据分组方法，但是还没有找到一种明显的方法来合并共享公共元素的列表元素。
谢谢大家！

apache-spark

来源：https://stackoverflow.com/questions/74802704/combine-associated-items-in-spark

3条答案

按热度按时间

bprjcwpo1#

正如一些用户已经说过的，这可以看作是一个图问题，即您希望找到图中的连通分支。
由于你正在使用spark，我认为这是一个很好的机会来展示如何在python中使用graphx，为了运行这个例子，你需要pyspark和graphframes python包。

from pyspark.sql import SparkSession
from graphframes import  GraphFrame
from pyspark.sql import functions as f

spark = (
    SparkSession.builder.appName("test")
    .config("spark.jars.packages", "graphframes:graphframes:0.8.2-spark3.2-s_2.12")
    .getOrCreate()
)
# graphframe requires defining a checkpoint dir.
spark.sparkContext.setCheckpointDir("/tmp/checkpoint")
# lets create a sample dataframe
df = spark.createDataFrame(
    [
        (1, ["A", "C", "D"]),
        (2, ["F", "H", "I", "P"]),
        (3, ["H", "I", "D"]),
        (4, ["X", "Y", "Z"]),
    ],
    ["id", "values"],
)

# We can use the explode function to explode the lists in new rows having a list of (id, node)
df = df.withColumn("node", f.explode("values"))
df.createOrReplaceTempView("temp_table")
# Then we can join the table with itself to generate an edge table with source and destination nodes.
edge_table = spark.sql(
    """
SELECT
distinct a.node as src, b.node as dst 
FROM
temp_table a join temp_table b 
ON a.id=b.id AND a.node != b.node
"""
)

# Now we define our graph by using an edge table (a table with the node ids)
# and our edge table
# then we use the connectedComponents method to find the components
cc_df = GraphFrame(
    df.selectExpr("node as id").drop_duplicates(), edge_table
).connectedComponents()

# The cc_df dataframe will have two columns, the node id and the connected component.
# To get the desired result we can group by the component and create a list
cc_df.groupBy("component").agg(f.collect_list("id")).show(truncate=False)

您将得到的输出如下所示：

可以使用以下命令安装依赖项：

pip install -q pyspark==3.2 graphframes

赞(0）回复(0）举报 2022-12-19

bf1o4zei2#

问题中可能没有足够的信息来完全解决这个问题，但我建议使用GraphX创建一个邻接矩阵/列表来将其表示为一个图，希望从那里你可以解决剩下的问题。
https://en.wikipedia.org/wiki/Adjacency_matrix
https://spark.apache.org/docs/latest/graphx-programming-guide.html

赞(0）回复(0）举报 2022-12-19

wdebmtf23#

如果您使用的是PySpark内核，那么这个解决方案应该可以工作

iset = set([frozenset(s) for s in tuple_list])  # Convert to a set of sets
result = []
while(iset):                  # While there are sets left to process:
    nset = set(iset.pop())      # Pop a new set
    check = len(iset)           # Does iset contain more sets
    while check:                # Until no more sets to check:
        check = False
        for s in iset.copy():       # For each other set:
            if nset.intersection(s):  # if they intersect:
                check = True            # Must recheck previous sets
                iset.remove(s)          # Remove it from remaining sets
                nset.update(s)          # Add it to the current set
    result.append(tuple(nset))  # Convert back to a list of tuples

赞(0）回复(0）举报 2022-12-19

我来回答

在Spark中合并关联项目

3条答案

相关问题

热门标签

最新问答