Apachespark:how to 获取Parquet输出文件大小和记录

ct3nt3jp  于 2021-06-03  发布在  Hadoop
关注(0)|答案(0)|浏览(357)

我是apachespark的新手,我想得到parquet输出文件的大小。
我的设想是
从csv读取文件并另存为文本文件

myRDD.saveAsTextFile("person.txt")

保存文件后(localhost:4040)显示输入字节15607801和输出字节13551724
但当我保存为Parquet文件时

myDF.saveAsParquetFile("person.perquet")

用户界面(localhost:4040)在stage选项卡上,只显示inputbytes 15607801,outputbytes中没有任何内容。
有人能帮我吗。提前谢谢
当我调用restapi时编辑它,并给出以下响应。

[ {
  "status" : "COMPLETE",
  "stageId" : 4,
  "attemptId" : 0,
  "numActiveTasks" : 0,
  "numCompleteTasks" : 1,
  "numFailedTasks" : 0,
  "executorRunTime" : 10955,
  "inputBytes" : 15607801,
  "inputRecords" : 1440721,

**"outputBytes" : 0,**
**"outputRecords" : 0,**

  "shuffleReadBytes" : 0,
  "shuffleReadRecords" : 0,
  "shuffleWriteBytes" : 0,
  "shuffleWriteRecords" : 0,
  "memoryBytesSpilled" : 0,
  "diskBytesSpilled" : 0,
  "name" : "saveAsParquetFile at ParquetExample.scala:82",
      "details" : "org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:1494)\ncom.spark.sql.ParquetExample$.main(ParquetExample.scala:82)\ncom.spark.sql.ParquetExample.main(ParquetExample.scala)",
  "schedulingPool" : "default",
  "accumulatorUpdates" : [ ]
}, {
  "status" : "COMPLETE",
  "stageId" : 3,
  "attemptId" : 0,
  "numActiveTasks" : 0,
  "numCompleteTasks" : 1,
  "numFailedTasks" : 0,
  "executorRunTime" : 2091,
  "inputBytes" : 15607801,
  "inputRecords" : 1440721,

**"outputBytes" : 13551724,**
**"outputRecords" : 1200540,**

  "shuffleReadBytes" : 0,
  "shuffleReadRecords" : 0,
  "shuffleWriteBytes" : 0,
  "shuffleWriteRecords" : 0,
  "memoryBytesSpilled" : 0,
  "diskBytesSpilled" : 0,
  "name" : "saveAsTextFile at ParquetExample.scala:77",
      "details" : "org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1379)\ncom.spark.sql.ParquetExample$.main(ParquetExample.scala:77)\ncom.spark.sql.ParquetExample.main(ParquetExample.scala)",
  "schedulingPool" : "default",
  "accumulatorUpdates" : [ ]
} ]

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题