python-3.x 如何读取xlsx或xls文件作为spark Dataframe

e7arh2l6 于 2023-02-01 发布在 Python

关注(0)|答案(7)|浏览(340)

有人能告诉我不转换xlsx或xls文件，我们如何将它们作为spark Dataframe 读取吗
我已经尝试用Pandas阅读，然后尝试转换为Spark Dataframe ，但得到了错误，错误是
错误：

Cannot merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>

代码：

import pandas
import os
df = pandas.read_excel('/dbfs/FileStore/tables/BSE.xlsx', sheet_name='Sheet1',inferSchema='')
sdf = spark.createDataFrame(df)

python-3.x

来源：https://stackoverflow.com/questions/56426069/how-to-read-xlsx-or-xls-files-as-spark-dataframe

7条答案

按热度按时间

bnlyeluc1#

我尝试根据@matkurek和@彼得潘的回答，给予2021年4月的通用更新版本。

Spark

您应该在数据库群集上安装以下2个库：
1.群集-〉选择群集-〉库-〉安装新项-〉Maven -〉在 * 坐标 * 中：化合物分析：Spark-excel_2.12：0.13.5
1.集群-〉选择集群-〉库-〉安装新的-〉PyPI-〉在 * 包 * 中：第十三次
然后，您将能够阅读您的excel如下：

sparkDF = spark.read.format("com.crealytics.spark.excel") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("dataAddress", "'NameOfYourExcelSheet'!A1") \
    .load(filePath)

Pandas

您应该在数据库群集上安装以下2个库：
1.集群-〉选择集群-〉库-〉安装新的-〉PyPI-〉在 * 包 * 中：第十三次
1.集群-〉选择集群-〉库-〉安装新的-〉PyPI-〉在 * 包 * 中：开放式平台
然后，您将能够阅读您的excel如下：

import pandas
pandasDF = pd.read_excel(io = filePath, engine='openpyxl', sheet_name = 'NameOfYourExcelSheet')

注意，您将有两个不同的对象，在第一个场景中是Spark Dataframe ，在第二个场景中是Pandas Dataframe 。

赞(0）回复(0）举报 2023-02-01

hs1ihplo2#

正如@matkurek提到的，你可以直接从excel中阅读它。事实上，这应该是一个比涉及Pandas更好的做法，因为那时Spark的好处就不存在了。
您可以运行与qbove定义相同的代码示例，但只需将所需的类添加到SparkSession的配置中。

spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2") \
.getOrCreate()

然后，您可以阅读您的excel文件。

df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'NameOfYourExcelSheet'!A1") \
.load("your_file"))

赞(0）回复(0）举报 2023-02-01

vohkndzv3#

你的帖子里没有你的excel数据，但是我复制了和你一样的问题。
下面是我的示例excel test.xlsx的数据。

您可以看到在我的列B中有不同的数据类型：双精度值2.2和字符串值C。
如果我运行下面的代码

import pandas

df = pandas.read_excel('test.xlsx', sheet_name='Sheet1',inferSchema='')
sdf = spark.createDataFrame(df)

它将返回与您的错误相同的错误。
TypeError: field B: Can not merge type <class 'pyspark.sql.types.DoubleType'> and class 'pyspark.sql.types.StringType'>

如果我们尝试通过df.dtypes检查df列的dtypes，我们将看到。

列B的dtype为object，spark.createDateFrame函数无法从实际数据中推断出列B的真实的数据类型，因此解决方法是传递一个schema来帮助推断列B的数据类型，如下面的代码所示。

from pyspark.sql.types import StructType, StructField, DoubleType, StringType
schema = StructType([StructField("A", DoubleType(), True), StructField("B", StringType(), True)])
sdf = spark.createDataFrame(df, schema=schema)

强制将列B设置为StringType以解决数据类型冲突。

赞(0）回复(0）举报 2023-02-01

baubqpgj4#

您可以通过spark的read函数读取excel文件。这需要一个spark插件，要将其安装到数据库中，请访问：
集群〉您的集群〉库〉安装新的〉选择Maven并在“坐标”中粘贴com.crealytics：spark-excel_2.12：0.13.5
之后，您可以通过以下方式读取文件：

df = spark.read.format("com.crealytics.spark.excel") \
    .option("useHeader", "true") \
    .option("inferSchema", "true") \
    .option("dataAddress", "'NameOfYourExcelSheet'!A1") \
    .load(filePath)

赞(0）回复(0）举报 2023-02-01

kg7wmglp5#

只需打开xlsx或xlms文件，在Pandas中打开文件，然后在spark中打开
进口Pandas当PD
pdf = www.example.com_excel（"文件. xlsx"，引擎="openpyxl"）pd.read_excel('file.xlsx', engine='openpyxl')
df =Spark会话.创建 Dataframe （df.类型（字符串））

赞(0）回复(0）举报 2023-02-01

cwxwcias6#

下面的配置和代码可以让我将excel文件读入pyspark Dataframe 。执行python代码前的先决条件。
在数据块群集上安装Maven库。
Maven库名称和版本：化学分析：spark-excel_2.12：0.13.5
数据块运行时：9.0（包括Apache Spark 3.1.2和Scala 2.12）
在python笔记本中执行以下代码，将excel文件加载到pyspark Dataframe 中：

sheetAddress = "'<enter sheetname>'!A1"
  filePath = "<enter excel file full path>"
  df = spark.read.format("com.crealytics.spark.excel") \
                                .option("header", "true") \
                                .option("dataAddress", sheetAddress) \
                                .option("treatEmptyValuesAsNulls", "false") \
                                .option("inferSchema", "true") \
                                .load(filePath)

赞(0）回复(0）举报 2023-02-01

bvn4nwqk7#

将. xls/. xlsx文件从Azure Blob存储读取到Spark DF的步骤

您可以借助名为spark-excel的库（也称为com.crealytics.spark.excel）将Azure blob存储中的excel文件读取到pyspark Dataframe 。
1.使用UI或Databricks CLI安装库。（群集设置页面〉库〉安装新选项。确保选择maven）
1.安装库后。你需要正确的凭据才能访问Azure Blob存储。你可以在"群集设置"页〉"高级选项"〉"Spark配置"中提供访问密钥
示例：

spark.hadoop.fs.azure.account.key.<storage-account>.blob.core.windows.net <access key>

注意：如果您是集群所有者，您可以将其作为机密提供，而不是像文档中提到的那样以纯文本形式提供访问密钥
1.重新启动群集。您可以使用以下代码读取位于blob存储中的excel文件

filePath = "wasbs://<container-name>@<storage-account>.blob.core.windows.net/MyFile1.xls"

DF = spark.read.format("excel").option("header", "true").option("inferSchema", "true").load(filePath)

display(DF)

PS：spark.read.format("excel")是V2的方法，而spark.read.format("com.crealytics.spark.excel")是V1，你可以阅读更多的here

赞(0）回复(0）举报 2023-02-01

我来回答

python-3.x 如何读取xlsx或xls文件作为spark Dataframe

7条答案

将. xls/. xlsx文件从Azure Blob存储读取到Spark DF的步骤

相关问题

热门标签

最新问答