pyspark-为什么提取json值都为空

0x6upsns  于 2022-11-28  发布在  Spark
关注(0)|答案(1)|浏览(172)

我有一个csv文件,其中有一个名为“jsonColumn”的列。下面是一个示例数据。

jsonColumn
{"page":"mainpage","_timestamp":"2022-11-22T10:47:45.8060+01:00","object":"object1","destination":"destination1","subObject":"subObject1","type":"event"}
...

现在我想从jsonColumn中提取几个字段,预期结果是

_timestamp,page,object,subObject
2022-11-22T10:47:45.8060+01:00,mainpage,object1,subObject1
...

这是我使用的代码,但为什么所有提取字段的值都是空的?

%python
from pyspark.sql import SparkSession 
from pyspark.sql.functions import get_json_object

spark=SparkSession.builder.appName('practice').getOrCreate()

df2 = spark.read.csv('/FileStore/test1.csv', header=True)

df2_extractJSON = df2.withColumn("_timestamp", get_json_object(df2.jsonColumn, "$._timestamp"))\
                     .withColumn("page", get_json_object(df2.jsonColumn, "$.page"))\
                     .withColumn("object", get_json_object(df2.jsonColumn, "$.object"))\
                     .withColumn("subObject", get_json_object(df2.jsonColumn, "$.subObject"))

                     
df2_extractJSON.show()

结果均为空。

原始 Dataframe 不为空请参阅下面屏幕截图中jsonColumn,它不为空

iqjalb3h

iqjalb3h1#

嗯,我已经执行了您的代码,它对我来说工作得很好:

from pyspark.sql import SparkSession 
from pyspark.sql.functions import get_json_object

df2 = spark.createDataFrame([('''{"page":"mainpage","_timestamp":"2022-11-22T10:47:45.8060+01:00","object":"object1","destination":"destination1","subObject":"subObject1","type":"event"}''',)], "jsonColumn: string")

df2_extractJSON = df2.withColumn("_timestamp", get_json_object(df2.jsonColumn, "$._timestamp"))\
                     .withColumn("page", get_json_object(df2.jsonColumn, "$.page"))\
                     .withColumn("object", get_json_object(df2.jsonColumn, "$.object"))\
                     .withColumn("subObject", get_json_object(df2.jsonColumn, "$.subObject"))

                     
df2_extractJSON.show()
+--------------------+--------------------+--------+-------+----------+
|          jsonColumn|          _timestamp|    page| object| subObject|
+--------------------+--------------------+--------+-------+----------+
|{"page":"mainpage...|2022-11-22T10:47:...|mainpage|object1|subObject1|
+--------------------+--------------------+--------+-------+----------+

您是否检查过df2是否正确地读入了数据?可能是因为初始DataFrame为空,所以这就是为什么在尝试提取数据时会出现NULL的原因。

相关问题