不能使用pyspark编写数据文件?

brccelvz  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(462)

我正在将我的Dataframe写入其他文件格式。从在线教程来看,这似乎应该管用。

sc=sparkContext()
spark=SparkSession(sc)

df = spark.read.csv("table.csv")

df.write.orc("tests/file.orc")

但是这个(write.orc)会导致这个长错误

20/06/01 13:54:32 ERROR Executor: Exception in task 0.0 in stage 63.0 (TID 63)

java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\t-aldouc\PycharmProjects\SE-DAT-EAIB-RecSysSpark\tests\test_data\test_orc.orc\_temporary\0\_temporary\attempt_20200601135431_0063_m_000000_63\part-00000-1b6b5b32-a036-414b-bb33-07e39b81c5d3-c000.snappy.orc

              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)

              at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)

              at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)

              at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)

              at org.apache.orc.impl.PhysicalFsWriter.<init>(PhysicalFsWriter.java:95)

              at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:177)

              at org.apache.orc.OrcFile.createWriter(OrcFile.java:860)

              at org.apache.orc.mapreduce.OrcOutputFormat.getRecordWriter(OrcOutputFormat.java:50)

              at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:43)

              at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:121)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)

              at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

              at org.apache.spark.scheduler.Task.run(Task.scala:123)

              at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)

              at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)

              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)

              at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

              at java.lang.Thread.run(Unknown Source)

20/06/01 13:54:32 WARN TaskSetManager: Lost task 0.0 in stage 63.0 (TID 63, localhost, executor driver): java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\t-aldouc\PycharmProjects\SE-DAT-EAIB-RecSysSpark\tests\test_data\test_orc.orc\_temporary\0\_temporary\attempt_20200601135431_0063_m_000000_63\part-00000-1b6b5b32-a036-414b-bb33-07e39b81c5d3-c000.snappy.orc

              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)

              at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)

              at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)

              at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)

              at org.apache.orc.impl.PhysicalFsWriter.<init>(PhysicalFsWriter.java:95)

              at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:177)

              at org.apache.orc.OrcFile.createWriter(OrcFile.java:860)

              at org.apache.orc.mapreduce.OrcOutputFormat.getRecordWriter(OrcOutputFormat.java:50)

              at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:43)

              at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:121)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)

              at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

              at org.apache.spark.scheduler.Task.run(Task.scala:123)

              at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)

              at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)

              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)

              at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

              at java.lang.Thread.run(Unknown Source)

20/06/01 13:54:32 ERROR TaskSetManager: Task 0 in stage 63.0 failed 1 times; aborting job

20/06/01 13:54:32 ERROR FileFormatWriter: Aborting job 55384644-29f4-4a2c-8a40-def2a2e2da73.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 63.0 failed 1 times, most recent failure: Lost task 0.0 in stage 63.0 (TID 63, localhost, executor driver): java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\t-aldouc\PycharmProjects\SE-DAT-EAIB-RecSysSpark\tests\test_data\test_orc.orc\_temporary\0\_temporary\attempt_20200601135431_0063_m_000000_63\part-00000-1b6b5b32-a036-414b-bb33-07e39b81c5d3-c000.snappy.orc

              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)

              at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)

              at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)

              at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)

              at org.apache.orc.impl.PhysicalFsWriter.<init>(PhysicalFsWriter.java:95)

              at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:177)

              at org.apache.orc.OrcFile.createWriter(OrcFile.java:860)

              at org.apache.orc.mapreduce.OrcOutputFormat.getRecordWriter(OrcOutputFormat.java:50)

              at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:43)

              at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:121)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)

              at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

              at org.apache.spark.scheduler.Task.run(Task.scala:123)

              at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)

              at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)

              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)

              at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

              at java.lang.Thread.run(Unknown Source)

Driver stacktrace:

              at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)

              at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)

              at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)

              at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

              at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

              at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)

              at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)

              at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)

              at scala.Option.foreach(Option.scala:257)

              at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)

              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)

              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)

              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)

              at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)

              at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)

              at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)

              at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)

              at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)

              at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)

              at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)

              at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)

              at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)

              at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)

              at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

              at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)

              at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)

              at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)

              at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)

              at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)

              at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)

              at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)

              at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)

              at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)

              at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)

              at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)

              at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)

              at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)

              at org.apache.spark.sql.DataFrameWriter.orc(DataFrameWriter.scala:588)

              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

              at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

              at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

              at java.lang.reflect.Method.invoke(Unknown Source)

              at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

              at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

              at py4j.Gateway.invoke(Gateway.java:282)

              at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

              at py4j.commands.CallCommand.execute(CallCommand.java:79)

              at py4j.GatewayConnection.run(GatewayConnection.java:238)

              at java.lang.Thread.run(Unknown Source)

Caused by: java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\t-aldouc\PycharmProjects\SE-DAT-EAIB-RecSysSpark\tests\test_data\test_orc.orc\_temporary\0\_temporary\attempt_20200601135431_0063_m_000000_63\part-00000-1b6b5b32-a036-414b-bb33-07e39b81c5d3-c000.snappy.orc

              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)

              at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)

              at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)

              at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)

              at org.apache.orc.impl.PhysicalFsWriter.<init>(PhysicalFsWriter.java:95)

              at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:177)

              at org.apache.orc.OrcFile.createWriter(OrcFile.java:860)

              at org.apache.orc.mapreduce.OrcOutputFormat.getRecordWriter(OrcOutputFormat.java:50)

              at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:43)

              at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:121)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)

              at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

              at org.apache.spark.scheduler.Task.run(Task.scala:123)

              at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)

              at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)

              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)

              at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

              ... 1 more

Traceback (most recent call last):

  File "<input>", line 1, in <module>

  File "C:\Users\t-aldouc\AppData\Local\Programs\Python\Python37\lib\site-packages\pyspark\sql\readwriter.py", line 960, in orc

    self._jwrite.orc(path)

  File "C:\Users\t-aldouc\AppData\Local\Programs\Python\Python37\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__

    answer, self.gateway_client, self.target_id, self.name)

  File "C:\Users\t-aldouc\AppData\Local\Programs\Python\Python37\lib\site-packages\pyspark\sql\utils.py", line 63, in deco

   return f(*a,**kw)

  File "C:\Users\t-aldouc\AppData\Local\Programs\Python\Python37\lib\site-packages\py4j\protocol.py", line 328, in get_return_value

    format(target_id, ".", name), value)

py4j.protocol.Py4JJavaError: An error occurred while calling o1488.orc.

: org.apache.spark.SparkException: Job aborted.

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)

              at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)

              at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)

              at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)

              at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)

              at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)

              at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)

              at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)

              at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

              at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)

              at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)

              at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)

              at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)

              at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)

              at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)

              at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)

              at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)

              at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)

              at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)

              at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)

              at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)

              at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)

              at org.apache.spark.sql.DataFrameWriter.orc(DataFrameWriter.scala:588)

              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

              at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

              at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

              at java.lang.reflect.Method.invoke(Unknown Source)

              at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

              at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

              at py4j.Gateway.invoke(Gateway.java:282)

              at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

              at py4j.commands.CallCommand.execute(CallCommand.java:79)

              at py4j.GatewayConnection.run(GatewayConnection.java:238)

              at java.lang.Thread.run(Unknown Source)

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 63.0 failed 1 times, most recent failure: Lost task 0.0 in stage 63.0 (TID 63, localhost, executor driver): java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\t-aldouc\PycharmProjects\SE-DAT-EAIB-RecSysSpark\tests\test_data\test_orc.orc\_temporary\0\_temporary\attempt_20200601135431_0063_m_000000_63\part-00000-1b6b5b32-a036-414b-bb33-07e39b81c5d3-c000.snappy.orc

              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)

              at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)

              at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)

              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)

              at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)

              at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)

              at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)

              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)

              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)

              at org.apache.orc.impl.PhysicalFsWriter.<init>(PhysicalFsWriter.java:95)

              at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:177)

              at org.apache.orc.OrcFile.createWriter(OrcFile.java:860)

              at org.apache.orc.mapreduce.OrcOutputFormat.getRecordWriter(OrcOutputFormat.java:50)

              at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:43)

              at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:121)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)

              at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)

              at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)

              at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

              at org.apache.spark.scheduler.Task.run(Task.scala:123)

              at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)

              at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)

              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)

              at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

              at java.lang.Thread.run(Unknown Source)

Driver stacktrace:

              at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)

.....can't enter this many characters

我在尝试编写Parquet文件和csv时收到了类似的错误,但是通过将df转换为pandas dataframe并使用to \u parquet()成功地编写了Parquet文件,但是我找不到类似的.orc文件解决方法。我该怎么解决这个问题?我已经尝试添加hadoop path变量,但是没有效果

czfnxgou

czfnxgou1#

我不得不以管理员的身份运行pycharm,然后它运行得很好

相关问题