json到avro到json

ntjbwcob  于 2021-05-19  发布在  Spark
关注(0)|答案(1)|浏览(432)

我正在尝试将一个json文件转换为avro并反转。
我的输入文件是

[
  {
    "userId": 1,
    "firstName": "Krish",
    "lastName": "Lee",
    "phoneNumber": "123456",
    "emailAddress": "krish.lee@abc.com"
  },
  {
    "userId": 2,
    "firstName": "racks",
    "lastName": "jacson",
    "phoneNumber": "123456",
    "emailAddress": "racks.jacson@abc.com"
  }
]

我的输出文件是

{"emailAddress":"krish.lee@abc.com","firstName":"Krish","lastName":"Lee","phoneNumber":"123456","userId":1}
{"emailAddress":"racks.jacson@abc.com","firstName":"racks","lastName":"jacson","phoneNumber":"123456","userId":2}

下面是我的源代码
json到avro

val df = spark.read.option("multiLine", true).json("src\\main\\resources\\user.json")
df.printSchema()
df.show()

//convert to avro
df.write.mode("append").format("com.databricks.spark.avro").save("src\\main\\resources\\user1")

avro到json

val jsonDF = spark.read
  .format("com.databricks.spark.avro").load("src\\main\\resources\\user")

jsonDF.show()
jsonDF.printSchema()
jsonDF.write.mode(SaveMode.Overwrite).json("src\\main\\resources\\output\\json")

你能帮忙吗

7vhp5slm

7vhp5slm1#

检查以下代码。
输入数据

scala> import sys.process._

scala> "cat /root/spark-examples/data.json".!
[
  {
    "userId": 1,
    "firstName": "Krish",
    "lastName": "Lee",
    "phoneNumber": "123456",
    "emailAddress": "krish.lee@abc.com"
  },
  {
    "userId": 2,
    "firstName": "racks",
    "lastName": "jacson",
    "phoneNumber": "123456",
    "emailAddress": "racks.jacson@abc.com"
  }
]

正在将json文件内容加载到 DataFrame ```
scala> val df = spark
.read
.option("multiline","true")
.json("/root/spark-examples/data.json")

df: org.apache.spark.sql.DataFrame = [emailAddress: string, firstName: string ... 3 more fields]

一旦json文件被加载到dataframe中,它将被转换为 `array of object` 进入 `multiple objects or rows` 就像下面一样。

scala> df.show(false)
+--------------------+---------+--------+-----------+------+
|emailAddress |firstName|lastName|phoneNumber|userId|
+--------------------+---------+--------+-----------+------+
|krish.lee@abc.com |Krish |Lee |123456 |1 |
|racks.jacson@abc.com|racks |jacson |123456 |2 |
+--------------------+---------+--------+-----------+------+

当你写作的时候 `DataFrame` 返回时,它会将其写入多行。

scala> df.repartition(1).write.mode("overwrite").json("/tmp/dataa/")

scala> "ls -ltr /tmp/dataa/".!
total 4
-rw-r--r-- 1 root root 222 Oct 22 12:19 part-00000-fa9e79f6-2689-4385-b3ee-fd19cf291a31-c000.json
-rw-r--r-- 1 root root 0 Oct 22 12:19 _SUCCESS

scala> "cat /tmp/dataa/part-00000-fa9e79f6-2689-4385-b3ee-fd19cf291a31-c000.json".!
{"emailAddress":"krish.lee@abc.com","firstName":"Krish","lastName":"Lee","phoneNumber":"123456","userId":1}
{"emailAddress":"racks.jacson@abc.com","firstName":"racks","lastName":"jacson","phoneNumber":"123456","userId":2}

如果您想要与您的输入数据相同,请遵循以下代码。

scala> df
.select(to_json(collect_list(struct($"*"))).as("data"))
.write
.format("text") // You need to use text format, Using json will give you wrong data.
.mode("overwrite")
.save("/tmp/datab/")

scala> "ls -ltr /tmp/datab/".!
total 4
-rw-r--r-- 1 root root 224 Oct 22 12:19 part-00000-0896730e-51e1-4728-bd6b-cdfabc03978e-c000.txt
-rw-r--r-- 1 root root 0 Oct 22 12:19 _SUCCESS

scala> "cat /tmp/datab/part-00000-0896730e-51e1-4728-bd6b-cdfabc03978e-c000.txt".!
[
{"emailAddress":"krish.lee@abc.com","firstName":"Krish","lastName":"Lee","phoneNumber":"123456","userId":1},
{"emailAddress":"racks.jacson@abc.com","firstName":"racks","lastName":"jacson","phoneNumber":"123456","userId":2}
]

相关问题