我意识到,在Google Colab环境中缓存PySpark Dataframe 后,使用任何方法(如show()或任何其他方法)都会出现以下错误:
例如:
df.show(5)
---------------------------------------------------------------------------
ConnectionRefusedError Traceback (most recent call last)
/tmp/ipykernel_26/1842469281.py in <module>
----> 1 df.show(5)
/opt/conda/lib/python3.7/site-packages/pyspark/sql/dataframe.py in show(self, n, truncate, vertical)
604
605 if isinstance(truncate, bool) and truncate:
--> 606 print(self._jdf.showString(n, 20, vertical))
607 else:
608 try:
/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
1318 proto.END_COMMAND_PART
1319
-> 1320 answer = self.gateway_client.send_command(command)
1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py in send_command(self, command, retry, binary)
1034 if `binary` is `True`.
1035 """
-> 1036 connection = self._get_connection()
1037 try:
1038 response = connection.send_command(command)
/opt/conda/lib/python3.7/site-packages/py4j/clientserver.py in _get_connection(self)
282
283 if connection is None or connection.socket is None:
--> 284 connection = self._create_new_connection()
285 return connection
286
/opt/conda/lib/python3.7/site-packages/py4j/clientserver.py in _create_new_connection(self)
289 self.java_parameters, self.python_parameters,
290 self.gateway_property, self)
--> 291 connection.connect_to_java_server()
292 self.set_thread_connection(connection)
293 return connection
/opt/conda/lib/python3.7/site-packages/py4j/clientserver.py in connect_to_java_server(self)
436 self.socket = self.ssl_context.wrap_socket(
437 self.socket, server_hostname=self.java_address)
--> 438 self.socket.connect((self.java_address, self.java_port))
439 self.stream = self.socket.makefile("rb")
440 self.is_connected = True
ConnectionRefusedError: [Errno 111] Connection refused
我是Spark/PySpark的新手,不明白为什么会发生这种情况。是因为我没有使用合适的集群吗?
1条答案
按热度按时间hc2pp10m1#
内存中似乎没有足够的空间来缓存 Dataframe !这是由于JVM中的内存溢出而导致的rdd长链接错误。
我不确定你是否可以增加google collab的内存,所以要么在google collab中使用较小的文件,要么在本地测试是否有足够的内存。