在panda Dataframe 上并行化时Azure Databricks执行错误。代码能够创建RDD,但在执行.collect()
时中断
安装程序:
import pandas as pd
# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14]]
# Create the pandas DataFrame
my_df = pd.DataFrame(data, columns = ['Name', 'Age'])
def testfn(i):
return my_df.iloc[i]
test_var=sc.parallelize([0,1,2],50).map(testfn).collect()
print (test_var)
错误:
Py4JJavaError Traceback (most recent call last)
<command-2941072546245585> in <module>
1 def testfn(i):
2 return my_df.iloc[i]
----> 3 test_var=sc.parallelize([0,1,2],50).map(testfn).collect()
4 print (test_var)
/databricks/spark/python/pyspark/rdd.py in collect(self)
901 # Default path used in OSS Spark / for non-credential passthrough clusters:
902 with SCCallSiteSync(self.context) as css:
--> 903 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
904 return list(_load_from_socket(sock_info, self._jrdd_deserializer))
905
/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
125 def deco(*a, **kw):
126 try:
--> 127 return f(*a, **kw)
128 except py4j.protocol.Py4JJavaError as e:
129 converted = convert_exception(e.java_exception)
/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 16 in stage 3845.0 failed 4 times, most recent failure: Lost task 16.3 in stage 3845.0 : org.apache.spark.api.python.PythonException: 'AttributeError: 'DataFrame' object has no attribute '_data'', from <command-2941072546245585>, line 2. Full traceback below:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 654, in main
process()
File "/databricks/spark/python/pyspark/worker.py", line 646, in process
serializer.dump_stream(out_iter, outfile)
File "/databricks/spark/python/pyspark/serializers.py", line 279, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/databricks/spark/python/pyspark/util.py", line 109, in wrapper
return f(*args, **kwargs)
File "<command-2941072546245585>", line 2, in testfn
File "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 1767, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 2137, in _getitem_axis
self._validate_integer(key, axis)
File "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 2060, in _validate_integer
len_axis = len(self.obj._get_axis(axis))
File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 424, in _get_axis
return getattr(self, name)
File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
return object.__getattribute__(self, name)
File "pandas/_libs/properties.pyx", line 63, in pandas._libs.properties.AxisProperty.__get__
File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute '_data'
版本详细信息:
Spark:“3.0.0”Python:3.7.6(默认值,2020年1月8日,19:59:22)[愚者7.3.0]
2条答案
按热度按时间uoifb46i1#
我见过这样的错误,当驱动程序和执行器有不同版本的Pandas安装。在我的情况下,它是驱动程序与Pandas 1.1.0(通过databricks-connect),而执行程序是在带有Pandas 1.0.1的Databricks Runtime 7.3上运行的。Pandas 1.1.0在内部有很大的变化,所以驱动程序发送给执行器的代码被破坏了。2你需要检查你的执行器和驱动程序是否有相同版本的Pandas(您可以在release notes中找到Databricks运行时使用的Pandas版本)。您可以使用following script比较执行器和驱动程序上的Python库版本。
mwecs4sa2#
我遇到了同样的问题。
我觉得是Pandas版本的差异。
我通过更新我的Pandas版本从1.0.1到1.0.5解决了这个bug