是否存在与Spark Pandas UDF等效的Apache Arrow

irlmq6kh 于 2022-11-05 发布在 Spark

关注(0)|答案(1)|浏览(156)

Spark提供了几种不同的方法来实现UDFs that consume and return Pandas DataFrames，我目前使用的是co-grouped版本，它将两个（co-grouped）Pandas DataFrame作为输入，并返回第三个。
为了在Spark DataFrames和Pandas DataFrames之间进行高效转换，Spark使用Apache Arrow内存布局，但是在Arrow和Pandas之间来回转换仍然需要转换。我真的很想直接访问Arrow数据，因为这是我最终处理UDF中数据的方式（使用Polars）。
从Spark-〉箭头-〉Pandas-〉箭头（极地）在进入的路上和相反的返回似乎是浪费。

pandas

来源：https://stackoverflow.com/questions/71606278/is-there-an-apache-arrow-equivalent-of-the-spark-pandas-udf