pyspark排序值

gywdnpxw 于 2021-07-13 发布在 Spark

关注(0)|答案(2)|浏览(344)

我有一个数据：

[(u'ab', u'cd'),
 (u'ef', u'gh'),
 (u'cd', u'ab'),
 (u'ab', u'gh'),
 (u'ab', u'cd')]

我想对这些数据做一个mapreduce，找出相同的成对出现的频率。
结果我得到：

[((u'ab', u'cd'), 2),
 ((u'cd', u'ab'), 1),
 ((u'ab', u'gh'), 1),
 ((u'ef', u'gh'), 1)]

正如你所看到的，（u'ab'，u'cd'）必须是3而不是2，因为（u'cd'，u'ab'）是同一对。
我的问题是如何让程序将（u'cd'，u'ab'）和（u'ab'，u'cd'）计算为同一对？我正在考虑对每行的值进行排序，但找不到任何解决方案。

mapreduce rdd apache-spark pyspark sorting

来源：https://stackoverflow.com/questions/66197266/pyspark-sort-values

2条答案

按热度按时间

uurity8g1#

您可以对值进行排序，然后使用 reduceByKey 数一对：

rdd1 = rdd.map(lambda x: (tuple(sorted(x)), 1))\
    .reduceByKey(lambda a, b: a + b)

rdd1.collect()

# [(('ab', 'gh'), 1), (('ef', 'gh'), 1), (('ab', 'cd'), 3)]

赞(0）回复(0）举报 2021-07-13

mrwjdhj32#

您可以按排序元素设置关键帧，并按关键帧计数：

result = rdd.keyBy(lambda x: tuple(sorted(x))).countByKey()

print(result)

# defaultdict(<class 'int'>, {('ab', 'cd'): 3, ('ef', 'gh'): 1, ('ab', 'gh'): 1})

要将结果转换为列表，可以执行以下操作：

result2 = sorted(result.items())

print(result2)

# [(('ab', 'cd'), 3), (('ab', 'gh'), 1), (('ef', 'gh'), 1)]

赞(0）回复(0）举报 2021-07-13

我来回答

pyspark排序值

2条答案

相关问题

热门标签

最新问答