from_json输出保存为空，当在架构中定义为Spark Dataframe的Int时

piok6c0g 于 2023-04-08 发布在 Spark

关注(0)|答案(1)|浏览(181)

使用from_json和schema时，使用Encoders创建schema，从case class但仅使用DF，而不是DS，如下所示：

case class MyProducts(PRODUCT_ID: Option[String], DESCRIPTION: Option[String], PRICE: Option[Int], OLD_FIELD_1: Option[String]) 
val ProductsSchema = Encoders.product[MyProducts].schema

val df_products_output_final = df_products_output.withColumn("parsedProducts", from_json(col("afterImage"), ProductsSchema))

1.当将PRICE定义为Int时，我在该字段中得到一个空值。
1.当将PRICE定义为String时，我在字段中得到一个String值。

DF模式中Int的DF定义是正确的。
这是什么问题？
验证码：

import org.json4s._
import org.json4s.jackson.JsonMethods._
import spark.implicits._
import org.apache.spark.sql.functions.{col, lit, when, from_json, map_keys, map_values, regexp_replace, coalesce}
import org.apache.spark.sql.types.{MapType, StringType}
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.types.{MapType, StringType, StructType, IntegerType}

case class MyMeta(op: String, table: String)
val metaSchema = Encoders.product[MyMeta].schema
case class MySales(NUM: Option[Integer], PRODUCT_ID: Option[String], DESCRIPTION: Option[String], OLD_FIELD_1: Option[String]) 
val salesSchema = Encoders.product[MySales].schema
case class MyProducts(PRODUCT_ID: Option[String], DESCRIPTION: Option[String], PRICE: Option[Int], OLD_FIELD_1: Option[String]) 
val ProductsSchema = Encoders.product[MyProducts].schema

def getAfterImage (op: String, data: String, key: String, jsonOLD_TABLE_FIELDS: String) : String = {   
  val jsonOLD_FIELDS = parse(jsonOLD_TABLE_FIELDS)   
  val jsonData = parse(data)                         
  val jsonKey = parse(key)                           
   
  op match {
  case "ins" =>
               return(compact(render(jsonData merge jsonOLD_FIELDS)))
  case _ => 
               val Diff(changed, added, deleted) = jsonKey diff jsonData
               return(compact(render(changed merge deleted merge jsonOLD_FIELDS)))
  }
}
val afterImage = spark.udf.register("callUDFAI", getAfterImage _)

val path = "/FileStore/tables/json_0006_file.txt"  
val df = spark.read.text(path)  // String.
val df2 = df.withColumn("value", from_json(col("value"), MapType(StringType, StringType)))    
val df3 = df2.select(map_values(col("value")))  
val df4 = df3.select($"map_values(value)"(0).as("meta"), $"map_values(value)"(1).as("data"), $"map_values(value)"(2).as("key")).withColumn("parsedMeta", from_json(col("meta"), metaSchema)).drop("meta").select(col("parsedMeta.*"), col("data"), col("key")).withColumn("key2", coalesce(col("key"), lit(""" { "DUMMY_FIELD_XXX": ""} """) )).toDF().cache()
// DF at this stage, not a DF.

val df_sales    = df4.filter('table === "BILL.SALES") 
val df_products = df4.filter('table === "BILL.PRODUCTS")
val df_sales_output = df_sales.withColumn("afterImage", afterImage(col("op"), col("data"), col("key2") , lit(""" { "OLD_FIELD_1": ""} """)))
                              .select("afterImage") 
val df_products_output = df_products.withColumn("afterImage", afterImage(col("op"), col("data"), col("key2") , lit(""" { "OLD_FIELD_A":"", "OLD_FIELD_B":""} """)))
                                    .select("afterImage")                          
val df_sales_output_final = df_sales_output.withColumn("parsedSales", from_json(col("afterImage"), salesSchema)) 
df_products_output_final.show(false)
df_products_output_final.printSchema()

JSON

来源：https://stackoverflow.com/questions/75953660/from-json-output-saved-as-null-when-defined-in-schema-as-int-for-spark-dataframe

1条答案

按热度按时间

gopyfrb31#

PRICE字段的值周围的引号将其搞乱。
如果您更改输入数据，从：

{ "meta":{ "op":"upd", "table":"BILL.PRODUCTS" }, "data":{ "DESCRIPTION":"XXX" }, "key":{ "PRODUCT_ID":"230117", "DESCRIPTION":"Hamsberry vintage tee, cherry", "PRICE":"4099" }}
{ "meta":{ "op":"upd", "table":"BILL.PRODUCTS" }, "data":{ "PRICE":"4000" }, "key":{ "PRODUCT_ID":"230117", "DESCRIPTION":"Hamsberry vintage tee, cherry", "PRICE":"3599" }}

到

{ "meta":{ "op":"upd", "table":"BILL.PRODUCTS" }, "data":{ "DESCRIPTION":"XXX" }, "key":{ "PRODUCT_ID":"230117", "DESCRIPTION":"Hamsberry vintage tee, cherry", "PRICE":4099 }}
{ "meta":{ "op":"upd", "table":"BILL.PRODUCTS" }, "data":{ "PRICE":4000 }, "key":{ "PRODUCT_ID":"230117", "DESCRIPTION":"Hamsberry vintage tee, cherry", "PRICE":3599 }}

(the差异只是PRICE值周围的引号）。
然后你从你的脚本中得到这个输出：

+--------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+
|afterImage                                                                                                          |parsedProducts                                     |
+--------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+
|{"DESCRIPTION":"XXX","PRODUCT_ID":"230117","PRICE":4099,"OLD_FIELD_A":"","OLD_FIELD_B":""}                          |{230117, XXX, 4099, null}                          |
|{"PRICE":4000,"PRODUCT_ID":"230117","DESCRIPTION":"Hamsberry vintage tee, cherry","OLD_FIELD_A":"","OLD_FIELD_B":""}|{230117, Hamsberry vintage tee, cherry, 4000, null}|
+--------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+

root
 |-- afterImage: string (nullable = true)
 |-- parsedProducts: struct (nullable = true)
 |    |-- PRODUCT_ID: string (nullable = true)
 |    |-- DESCRIPTION: string (nullable = true)
 |    |-- PRICE: integer (nullable = true)
 |    |-- OLD_FIELD_1: string (nullable = true)

没有PRICE的null值了！！

赞(0）回复(0）举报 2023-04-08

我来回答

from_json输出保存为空，当在架构中定义为Spark Dataframe的Int时

1条答案

相关问题

热门标签

最新问答