使用pyspark将每个json对象作为Dataframe中的单行读取?

dphi5xsq  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(547)

我有下面的json文件

{"name":"John", "age":31, "city":"New York"}
{"name":"Henry", "age":41, "city":"Boston"}
{"name":"Dave", "age":26, "city":"New York"}

因此,我需要将每个json行作为一行与Dataframe一起读取。
以下是预期输出:

我尝试了以下代码:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName('Read Json') \
    .getOrCreate()

df = spark.read.format('json').load('sample_json')
df.show()

但我只能得到以下输出:

请帮帮我。提前谢谢。

ajsxfq5m

ajsxfq5m1#

将文件读取为 json 然后使用 to_json 要创建的函数 json_column . 1.Using to_json function: ```
from pyspark.sql.functions import *
spark.read.json("sample.json").
withColumn("Json_column",to_json(struct(col("age"),col('city'),col('name')))).
show(10,False)

+---+--------+-----+------------------------------------------+

|age|city |name |Json_column |

+---+--------+-----+------------------------------------------+

|31 |New York|John |{"age":31,"city":"New York","name":"John"}|

|41 |Boston |Henry|{"age":41,"city":"Boston","name":"Henry"} |

|26 |New York|Dave |{"age":26,"city":"New York","name":"Dave"}|

+---+--------+-----+------------------------------------------+

or more dynamic way

df=spark.read.json("sample.json")
df.withColumn("Json_column",to_json(struct([col(c) for c in df.columns]))).show(10,False)

+---+--------+-----+------------------------------------------+

|age|city |name |Json_column |

+---+--------+-----+------------------------------------------+

|31 |New York|John |{"age":31,"city":"New York","name":"John"}|

|41 |Boston |Henry|{"age":41,"city":"Boston","name":"Henry"} |

|26 |New York|Dave |{"age":26,"city":"New York","name":"Dave"}|

+---+--------+-----+------------------------------------------+

``` 2.Other approach using get_json_object function: 以文本形式读取json文件,然后创建 name,age,city 从中提取列 json object .

from pyspark.sql.functions import *
spark.read.text("sample.json").\
withColumn("name",get_json_object(col("value"),"$.name")).\
withColumn("city",get_json_object(col("value"),"$.city")).\
withColumn("age",get_json_object(col("value"),"$.age")).\
withColumnRenamed("value","Json_column").\
select("age","city","name","Json_column").\
show(10,False)

# +---+--------+-----+--------------------------------------------+

# |age|city    |name |Json_column                                 |

# +---+--------+-----+--------------------------------------------+

# |31 |New York|John |{"name":"John", "age":31, "city":"New York"}|

# |41 |Boston  |Henry|{"name":"Henry", "age":41, "city":"Boston"} |

# |26 |New York|Dave |{"name":"Dave", "age":26, "city":"New York"}|

# +---+--------+-----+--------------------------------------------+

相关问题