我有大约7gb记录的巨大Dataframe。我试图得到的Dataframe计数,并下载它作为csv两个结果在下面的错误。有没有其他方法下载Dataframe而不需要多个分区
print(df.count())
df.coalesce(1).write.option("header", "true").csv('/user/ABC/Output.csv')
Error:
java.io.IOException: Stream is corrupted
at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202)
at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228)
at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/05/26 18:15:44 ERROR scheduler.TaskSetManager: Task 8 in stage 360.0 failed 1 times; aborting job
[Stage 360:=======> (8 + 1) / 60]
Py4JJavaError: An error occurred while calling o18867.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 360.0 failed 1 times, most recent failure: Lost task 8.0 in stage 360.0 (TID 13986, localhost, executor driver): java.io.IOException: Stream is corrupted
at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:202)
at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:228)
at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
at org.apache.spark.io.ReadAheadInputStream$1.run(ReadAheadInputStream.java:168)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
暂无答案!
目前还没有任何答案,快来回答吧!