如何在没有公共密钥的情况下合并apachespark中的两个Dataframe？

ybzsozfc 于 2021-06-02 发布在 Hadoop

关注(0)|答案(1)|浏览(333)

我正在尝试连接两个Dataframe。
数据：Dataframe[\u 1:bigint，\u 2:vector]
集群：Dataframe[cluster:bigint]

result = data.join(broadcast(cluster))

奇怪的是，所有的执行者在加入步骤上都失败了。
我不知道我能做什么。
hdfs上的数据文件是2.8gb，集群数据只有5mb。文件读取使用Parquet。

hadoop python apache-spark apache-spark-sql parquet

来源：https://stackoverflow.com/questions/39876536/how-to-merge-two-dataframes-in-apache-spark-without-common-key

1条答案

按热度按时间

xkrw2x1b1#

工作原理是：

data = sqlContext.read.parquet(data_path)
data = data.withColumn("id", monotonicallyIncreasingId())

cluster = sqlContext.read.parquet(cluster_path)  
cluster = cluster.withColumn("id", monotonicallyIncreasingId())

result = data.join(cluster, on="id")

将群集Dataframe直接添加到Dataframe中：

data.withColumn("cluster", cluster.cluster)

不起作用。

data.join(cluster)

也不起作用，执行器在有足够内存的情况下失败。
不知道为什么不起作用。。。

赞(0）回复(0）举报 2021-06-03

我来回答

如何在没有公共密钥的情况下合并apachespark中的两个Dataframe？

1条答案

相关问题

热门标签

最新问答