Apache Spark 将Tuple的Tuple作为DataFrame中的数据会导致AttributeError：'tuple'对象没有属性'encode'

nhjlsmyf 于 2022-12-04 发布在 Apache

关注(0)|答案(2)|浏览(148)

假设我有示例数据：

sdata = [(1,(10,20,30)),
         (2,(100,20)),
         (3,(100,200,300))]

columns = [('Sn','Products')]
df1 = spark.createDataFrame(([x[0],*x[1]] for x in sdata), schema=columns)

获取错误：
属性错误：'tuple'对象没有属性'encode'
如何加载此变长数据？

apache-spark

来源：https://stackoverflow.com/questions/74670430/tuple-of-tuple-as-data-in-dataframe-results-in-attributeerror-tuple-object-ha

2条答案

按热度按时间

vc6uscn91#

可以将元组表示为StructType;但是它有固定的字段，我不确定“可变长度”的元组;但是如果您要求在集合类型中支持可变数量的元素，那么您可以定义一个显式模式：

sdata = [(1,(10,20,30)),
         (2,(100,20)),
         (3,(100,200,300))]

schema = StructType([
  StructField('Sn', LongType()),
  StructField('Products', ArrayType(LongType())),
])

df1 = spark.createDataFrame(sdata, schema=schema)

[Out]:
+---+---------------+
| Sn|       Products|
+---+---------------+
|  1|   [10, 20, 30]|
|  2|      [100, 20]|
|  3|[100, 200, 300]|
+---+---------------+

或者直接将字段用作数组：

sdata = [(1,[10,20,30]),
         (2,[100,20]),
         (3,[100,200,300])]

columns = ['Sn','Products']

df1 = spark.createDataFrame(sdata, schema=columns)

[Out]:
+---+---------------+
| Sn|       Products|
+---+---------------+
|  1|   [10, 20, 30]|
|  2|      [100, 20]|
|  3|[100, 200, 300]|
+---+---------------+

赞(0）回复(0）举报 2022-12-04

siv3szwd2#

要将可变长度的数据加载到PySpark DataFrame中，可以使用pyspark.sql.types模块中的ArrayType（）函数来定义DataFrame的架构。ArrayType（）函数允许您指定数组中元素的数据类型，它可用于定义DataFrame中包含可变数量元素的列。
以下是如何使用ArrayType（）函数定义包含可变长度数据之DataFrame结构描述的范例：

# Import the ArrayType() function
from pyspark.sql.types import ArrayType

# Define the sample data
sdata = [(1,(10,20,30)),
            (2,(100,20)),
            (3,(100,200,300))]

# Use the ArrayType() function to define the schema of the DataFrame
columns = [('Sn', IntegerType()),
            ('Products', ArrayType(IntegerType()))]

# Create the DataFrame with the defined schema
df1 = spark.createDataFrame(([x[0],*x[1]] for x in sdata), schema=columns)

# Print the schema of the DataFrame
df1.printSchema()

赞(0）回复(0）举报 2022-12-04

我来回答

Apache Spark 将Tuple的Tuple作为DataFrame中的数据会导致AttributeError：'tuple'对象没有属性'encode'

2条答案

相关问题

热门标签

最新问答