使用PySpark从ADLS装载将大型Excel文件加载到数据块中

q5lcpyga 于 2023-02-09 发布在 Apache

关注(0)|答案(1)|浏览(214)

我们正在尝试使用Databricks上的pyspark从安装的Azure数据湖位置加载一个较大的Excel文件。
我们使用pyspark.pandas加载，也使用spark-excel加载，但都不太成功

PySpark.Pandas

import pyspark.pandas as ps
df = ps.read_excel("dbfs:/mnt/aadata/ds/data/test.xlsx",engine="openpyxl")

我们遇到一些转换错误，如下所示

ArrowTypeError: Expected bytes, got a 'int' object

Spark-excel

df=spark.read.format("com.crealytics.spark.excel") \
        .option("header", "true") \
        .option("inferSchema","false") \
        .load('dbfs:/mnt/aadata/ds/data/test.xlsx')

我们可以加载较小的文件，但较大的文件会出现以下错误

org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 185,568,653, but the maximum length for this record type is 100,000,000.

有没有其他方法可以用pyspark加载数据库中的excel文件？

apache-spark

来源：https://stackoverflow.com/questions/75393832/load-large-excel-files-in-databricks-using-pyspark-from-an-adls-mount