在将pysparkDataframe转换为pandasDataframe(topandas)时,出现以下错误:
File "/usr/local/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2121, in toPandas batches = self._collectAsArrow()
File "/usr/local/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2179, in _collectAsArrow return list(_load_from_socket(sock_info, ArrowStreamSerializer()))
File "/usr/local/lib/python3.7/site-packages/pyspark/rdd.py", line 144, in _load_from_socket (sockfile, sock) = local_connect_and_auth(*sock_info)
TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were given
我用的是pyarrow。为此,我在sparkconf中添加了以下conf:
.set("spark.sql.execution.arrow.enabled", "true")
.set("spark.sql.execution.arrow.fallback.enabled", "true")
.set("spark.sql.execution.arrow.maxRecordsPerBatch", 5000)
在这个过程中,我读了大约8000万行。
注意:只有当我启用“arrow”时,这个错误才会出现。
暂无答案!
目前还没有任何答案,快来回答吧!