如何将字典列表转换为pysparkDataframe

o4tp2gmn  于 2021-07-14  发布在  Spark
关注(0)|答案(4)|浏览(446)

我想把字典列表转换成Dataframe。以下是列表:

mylist = 
[
  {"type_activity_id":1,"type_activity_name":"xxx"},
  {"type_activity_id":2,"type_activity_name":"yyy"},
  {"type_activity_id":3,"type_activity_name":"zzz"}
]

这是我的密码:

from pyspark.sql.types import StringType

df = spark.createDataFrame(mylist, StringType())

df.show(2,False)

+-----------------------------------------+
|                                    value|
+-----------------------------------------+
|{type_activity_id=1,type_activity_id=xxx}|
|{type_activity_id=2,type_activity_id=yyy}|
|{type_activity_id=3,type_activity_id=zzz}|
+-----------------------------------------+

我假设我应该为每一列提供一些Map和类型,但是我不知道怎么做。
更新:
我也试过这个:

schema = ArrayType(
    StructType([StructField("type_activity_id", IntegerType()),
                StructField("type_activity_name", StringType())
                ]))
df = spark.createDataFrame(mylist, StringType())
df = df.withColumn("value", from_json(df.value, schema))

但后来我 null 价值观:

+-----+
|value|
+-----+
| null|
| null|
+-----+
toe95027

toe950271#

你可以这样做。您将得到一个包含2列的Dataframe。

mylist = [
  {"type_activity_id":1,"type_activity_name":"xxx"},
  {"type_activity_id":2,"type_activity_name":"yyy"},
  {"type_activity_id":3,"type_activity_name":"zzz"}
]

myJson = sc.parallelize(mylist)
myDf = sqlContext.read.json(myJson)

输出:

+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
|               1|               xxx|
|               2|               yyy|
|               3|               zzz|
+----------------+------------------+
nkcskrwz

nkcskrwz2#

在过去,你只需要把字典传给 spark.createDataFrame() ,但现在不赞成这样做:

mylist = [
  {"type_activity_id":1,"type_activity_name":"xxx"},
  {"type_activity_id":2,"type_activity_name":"yyy"},
  {"type_activity_id":3,"type_activity_name":"zzz"}
]
df = spark.createDataFrame(mylist)

# UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead

# warnings.warn("inferring schema from dict is deprecated,"

正如警告信息所说,您应该使用 pyspark.sql.Row 相反。

from pyspark.sql import Row
spark.createDataFrame(Row(**x) for x in mylist).show(truncate=False)

# +----------------+------------------+

# |type_activity_id|type_activity_name|

# +----------------+------------------+

# |1               |xxx               |

# |2               |yyy               |

# |3               |zzz               |

# +----------------+------------------+

我用过这里 ** (关键字参数解包)将字典传递给 Row 建造师。

jpfvwuh4

jpfvwuh43#

在spark版本2.4中,可以直接使用df=spark.createdataframe(mylist)执行此操作

>>> mylist = [
...   {"type_activity_id":1,"type_activity_name":"xxx"},
...   {"type_activity_id":2,"type_activity_name":"yyy"},
...   {"type_activity_id":3,"type_activity_name":"zzz"}
... ]
>>> df1=spark.createDataFrame(mylist)
>>> df1.show()
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
|               1|               xxx|
|               2|               yyy|
|               3|               zzz|
+----------------+------------------+
zpf6vheq

zpf6vheq4#

我在创作时也面临着同样的问题 dataframe 从字典列表中。我已经解决了这个问题 namedtuple .
下面是我的代码使用提供的数据。

from collections import namedtuple
final_list = []
mylist = [{"type_activity_id":1,"type_activity_name":"xxx"},
          {"type_activity_id":2,"type_activity_name":"yyy"}, 
          {"type_activity_id":3,"type_activity_name":"zzz"}
         ]
ExampleTuple = namedtuple('ExampleTuple', ['type_activity_id', 'type_activity_name'])

for my_dict in mylist:
    namedtupleobj = ExampleTuple(**my_dict)
    final_list.append(namedtupleobj)

sqlContext.createDataFrame(final_list).show(truncate=False)

输出

+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
|1               |xxx               |
|2               |yyy               |
|3               |zzz               |
+----------------+------------------+

我的版本信息如下

spark: 2.4.0
python: 3.6

没有必要有 my_list 变量。因为它是可用的,所以我用它直接创建namedtuple对象 namedtuple 可以创建对象。

相关问题