//load data into RDD and define a group key
val rdd = sc.cassandraTable[(String, String)] ("test", "test")
.select("id" as "_1", "header" as "_2")
.keyBy[Tuple1[Int]]("id")
// check that partitioner is CassandraPartitioner
rdd.partitioner
// call distinct for each group, flat it, get two column DF
val df = rdd.groupByKey.flatMap {case (key,group) => group.toSeq.distinct}
.toDF("id", "header")
1条答案
按热度按时间zpgglvta1#
Spark Dataframe API还不支持自定义分区器。因此连接器无法将C* 分区器引入Dataframe引擎。RDD Spark API支持来自其他方面的自定义分区器。因此,您可以将数据加载到RDD中,然后将其转换为df。以下是有关C* 分区器用法的连接器文档:https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
keyBy()函数允许您定义用于分组键列
这是一个工作示例。它不短,所以我希望有人可以改进它: