无法通过pyspark读取parquet中的一列,而其他列的读取是正确的

bkhjykvo  于 2021-05-29  发布在  Spark
关注(0)|答案(0)|浏览(261)

我有一些带有字符串ID的列。其中一个读对了,但另一个不管我做什么都不能是红色的。我设置了限制(20),但仍然没有结果。我的Parquet文件是相当大的事实上,但为什么我不能读只有一列,没有它一切顺利。

java --version
openjdk 11.0.7 2020-04-14
OpenJDK Runtime Environment (build 11.0.7+10-post-Ubuntu-2ubuntu218.04)
OpenJDK 64-Bit Server VM (build 11.0.7+10-post-Ubuntu-2ubuntu218.04, mixed mode, sharing)

pyspark --version
20/06/19 15:56:20 WARN Utils: Your hostname, anton-G5-5587 resolves to a loopback address: 127.0.1.1; using 192.168.0.25 instead (on interface wlp0s20f3)
20/06/19 15:56:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/anton/Documents/libs/spark/jars/spark-unsafe_2.12-3.0.0.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0
      /_/

Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 11.0.7
Branch HEAD
Compiled by user ubuntu on 2020-06-06T13:05:28Z

我使用为Apache3.2和更高版本预先构建的spark

spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()

df = spark.read.load("data/purchases.parquet")

df.select('client_id', 'product_id').limit(20).write.csv('c_id_to_p_id.csv')

在我运行了上面的代码之后,我得到了以下信息(完整的消息太长,无法发布)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/anton/.local/lib/python3.6/site-packages/py4j/java_gateway.py", line 1115, in start
    self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 111] Connection refused
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:46655)
Traceback (most recent call last):
  File "/home/anton/.local/lib/python3.6/site-packages/py4j/java_gateway.py", line 977, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/anton/.local/lib/python3.6/site-packages/py4j/java_gateway.py", line 1115, in start
    self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 111] Connection refused

我的Parquet文件架构如下所示

StructType( 
 List( 
  StructField(client_id,StringType,true),
  ...
  StructField(product_id,StringType,true),
  ...
))

有人遇到过这样的问题吗?谢谢你的帮助

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题