使用scala/spark进行数据治理

omvjsjqw 于 2023-03-09 发布在 Apache

关注(0)|答案(1)|浏览(140)

我有一个ETL来分析大数据，我所有的表都是Spark 2.2.X的DataFrame。现在，我必须添加数据治理，以便了解数据的来源。例如：
表A

| Col1 | Col2 |  
| ---- | ---- |  
| test | hello |  
| test3 | bye |

表B

| Col1 | Col2 |  
| ---- | ---- |  
| test2 | hey |  
| test3 | bye |

现在我有了两个表，我要做的是连接Col1和Col2 + Col2。
最终表

| Col1 | Col2 |  
| ---- | ---- |  
|test3 | byebye|

我的问题是，Spark DataFrame、API或其他东西中是否有任何函数不会让我对代码进行太多更改，并且可以显示DataFrame中的所有转换？

apache-spark

来源：https://stackoverflow.com/questions/51472745/data-governance-with-scala-spark

1条答案

按热度按时间

o0lyfsai1#

如果你想快速解决这个问题，你可以看看RDD#toDebugString，你可以在你的DataFrame上调用rdd方法，然后通过这个方法显示它的血统。
以下是Jacek Laskowski's book "Mastering Apache Spark"的示例：

scala> val wordCount = sc.textFile("README.md").flatMap(_.split("\\s+")).map((_, 1)).reduceByKey(_ + _)
wordCount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[21] at reduceByKey at <console>:24

scala> wordCount.toDebugString
res13: String =
(2) ShuffledRDD[21] at reduceByKey at <console>:24 []
 +-(2) MapPartitionsRDD[20] at map at <console>:24 []
    |  MapPartitionsRDD[19] at flatMap at <console>:24 []
    |  README.md MapPartitionsRDD[18] at textFile at <console>:24 []
    |  README.md HadoopRDD[17] at textFile at <console>:24 []

这个代码片段，沿着关于RDD谱系和toDebugString的详细解释可以在这里找到。

赞(0）回复(0）举报 2023-03-09

我来回答

使用scala/spark进行数据治理

1条答案

相关问题

热门标签

最新问答