2023年4月7日 04:49:29go评论75阅读模式

英文:

from_json output saved as null when defined in schema as Int for Spark Dataframe

问题

在使用from_json与使用Encoders创建的schema一起使用时，是从case class中提取的，但只使用DF而不是DS，如下所示：

case class MyProducts(PRODUCT_ID: Option[String], DESCRIPTION: Option[String], PRICE: Option[Int], OLD_FIELD_1: Option[String]) 
val ProductsSchema = Encoders.product[MyProducts].schema

val df_products_output_final = df_products_output.withColumn("parsedProducts", from_json(col("afterImage"), ProductsSchema))

> 1. 将PRICE定义为Int时，该字段的值为null。
> 2. 将PRICE定义为String时，该字段的值为String。
> 3. 在DF模式中，将Int定义为正确的DF定义。

问题出在哪里？

import org.json4s._
import org.json4s.jackson.JsonMethods._
import spark.implicits._
import org.apache.spark.sql.functions.{col, lit, when, from_json, map_keys, map_values, regexp_replace, coalesce}
import org.apache.spark.sql.types.{MapType, StringType}
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.types.{MapType, StringType, StructType, IntegerType}

case class MyMeta(op: String, table: String)
val metaSchema = Encoders.product[MyMeta].schema
case class MySales(NUM: Option[Integer], PRODUCT_ID: Option[String], DESCRIPTION: Option[String], OLD_FIELD_1: Option[String]) 
val salesSchema = Encoders.product[MySales].schema
case class MyProducts(PRODUCT_ID: Option[String], DESCRIPTION: Option[String], PRICE: Option[Int], OLD_FIELD_1: Option[String]) 
val ProductsSchema = Encoders.product[MyProducts].schema

def getAfterImage (op: String, data: String, key: String, jsonOLD_TABLE_FIELDS: String) : String = {   
  val jsonOLD_FIELDS = parse(jsonOLD_TABLE_FIELDS)   
  val jsonData = parse(data)                         
  val jsonKey = parse(key)                           
   
  op match {
  case "ins" =>
               return(compact(render(jsonData merge jsonOLD_FIELDS)))
  case _ => 
               val Diff(changed, added, deleted) = jsonKey diff jsonData
               return(compact(render(changed merge deleted merge jsonOLD_FIELDS)))
  }
}
val afterImage = spark.udf.register("callUDFAI", getAfterImage _)

val path = "/FileStore/tables/json_0006_file.txt"  
val df = spark.read.text(path)  // String.
val df2 = df.withColumn("value", from_json(col("value"), MapType(StringType, StringType)))    
val df3 = df2.select(map_values(col("value")))  
val df4 = df3.select($"map_values(value)"(0).as("meta"), $"map_values(value)"(1).as("data"), $"map_values(value)"(2).as("key")).withColumn("parsedMeta", from_json(col("meta"), metaSchema)).drop("meta").select(col("parsedMeta.*"), col("data"), col("key")).withColumn("key2", coalesce(col("key"), lit(""" { "DUMMY_FIELD_XXX": ""} """) )).toDF().cache()
// 此阶段的DF，不是DF。

val df_sales    = df4.filter('table === "BILL.SALES") 
val df_products = df4.filter('table === "BILL.PRODUCTS")
val df_sales_output = df_sales.withColumn("afterImage", afterImage(col("op"), col("data"), col("key2") , lit(""" { "OLD_FIELD_1": ""} """)))
                              .select("afterImage") 
val df_products_output = df_products.withColumn("afterImage", afterImage(col("op"), col("data"), col("key2") , lit(""" { "OLD_FIELD_A":"","OLD_FIELD_B":""} """)))
                                    .select("afterImage")                          
val df_sales_output_final = df_sales_output.withColumn("parsedSales", from_json(col("afterImage"), salesSchema)) 
df_products_output_final.show(false)
df_products_output_final.printSchema()

英文:

When using from_json with schema created with Encoders, from a case class but only using DF, not DS, as per below:

case class MyProducts(PRODUCT_ID: Option[String], DESCRIPTION: Option[String], PRICE: Option[Int], OLD_FIELD_1: Option[String]) 
val ProductsSchema = Encoders.product[MyProducts].schema

val df_products_output_final = df_products_output.withColumn(&quot;parsedProducts&quot;, from_json(col(&quot;afterImage&quot;), ProductsSchema))

> 1. When defining PRICE as Int, I get a null value in the field.
> 2. When defining PRICE as String, I get a String value in the field.
> 3. The DF definition for Int is correct in DF schema.

What is the issue here?

Code:

import org.json4s._
import org.json4s.jackson.JsonMethods._
import spark.implicits._
import org.apache.spark.sql.functions.{col, lit, when, from_json, map_keys, map_values, regexp_replace, coalesce}
import org.apache.spark.sql.types.{MapType, StringType}
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.types.{MapType, StringType, StructType, IntegerType}

case class MyMeta(op: String, table: String)
val metaSchema = Encoders.product[MyMeta].schema
case class MySales(NUM: Option[Integer], PRODUCT_ID: Option[String], DESCRIPTION: Option[String], OLD_FIELD_1: Option[String]) 
val salesSchema = Encoders.product[MySales].schema
case class MyProducts(PRODUCT_ID: Option[String], DESCRIPTION: Option[String], PRICE: Option[Int], OLD_FIELD_1: Option[String]) 
val ProductsSchema = Encoders.product[MyProducts].schema

def getAfterImage (op: String, data: String, key: String, jsonOLD_TABLE_FIELDS: String) : String = {   
  val jsonOLD_FIELDS = parse(jsonOLD_TABLE_FIELDS)   
  val jsonData = parse(data)                         
  val jsonKey = parse(key)                           
   
  op match {
  case &quot;ins&quot; =&gt;
               return(compact(render(jsonData merge jsonOLD_FIELDS)))
  case _ =&gt; 
               val Diff(changed, added, deleted) = jsonKey diff jsonData
               return(compact(render(changed merge deleted merge jsonOLD_FIELDS)))
  }
}
val afterImage = spark.udf.register(&quot;callUDFAI&quot;, getAfterImage _)

val path = &quot;/FileStore/tables/json_0006_file.txt&quot;  
val df = spark.read.text(path)  // String.
val df2 = df.withColumn(&quot;value&quot;, from_json(col(&quot;value&quot;), MapType(StringType, StringType)))    
val df3 = df2.select(map_values(col(&quot;value&quot;)))  
val df4 = df3.select($&quot;map_values(value)&quot;(0).as(&quot;meta&quot;), $&quot;map_values(value)&quot;(1).as(&quot;data&quot;), $&quot;map_values(value)&quot;(2).as(&quot;key&quot;)).withColumn(&quot;parsedMeta&quot;, from_json(col(&quot;meta&quot;), metaSchema)).drop(&quot;meta&quot;).select(col(&quot;parsedMeta.*&quot;), col(&quot;data&quot;), col(&quot;key&quot;)).withColumn(&quot;key2&quot;, coalesce(col(&quot;key&quot;), lit(&quot;&quot;&quot; { &quot;DUMMY_FIELD_XXX&quot;: &quot;&quot;} &quot;&quot;&quot;) )).toDF().cache()
// DF at this stage, not a DF.

val df_sales    = df4.filter(&#39;table === &quot;BILL.SALES&quot;) 
val df_products = df4.filter(&#39;table === &quot;BILL.PRODUCTS&quot;)
val df_sales_output = df_sales.withColumn(&quot;afterImage&quot;, afterImage(col(&quot;op&quot;), col(&quot;data&quot;), col(&quot;key2&quot;) , lit(&quot;&quot;&quot; { &quot;OLD_FIELD_1&quot;: &quot;&quot;} &quot;&quot;&quot;)))
                              .select(&quot;afterImage&quot;) 
val df_products_output = df_products.withColumn(&quot;afterImage&quot;, afterImage(col(&quot;op&quot;), col(&quot;data&quot;), col(&quot;key2&quot;) , lit(&quot;&quot;&quot; { &quot;OLD_FIELD_A&quot;:&quot;&quot;, &quot;OLD_FIELD_B&quot;:&quot;&quot;} &quot;&quot;&quot;)))
                                    .select(&quot;afterImage&quot;)                          
val df_sales_output_final = df_sales_output.withColumn(&quot;parsedSales&quot;, from_json(col(&quot;afterImage&quot;), salesSchema)) 
df_products_output_final.show(false)
df_products_output_final.printSchema()

答案1

得分: 1

你的 PRICE 字段值周围的引号弄乱了这个问题。

如果你将输入数据从：

{ &quot;meta&quot;:{ &quot;op&quot;:&quot;upd&quot;, &quot;table&quot;:&quot;BILL.PRODUCTS&quot; }, &quot;data&quot;:{ &quot;DESCRIPTION&quot;:&quot;XXX&quot; }, &quot;key&quot;:{ &quot;PRODUCT_ID&quot;:&quot;230117&quot;, &quot;DESCRIPTION&quot;:&quot;Hamsberry vintage tee, cherry&quot;, &quot;PRICE&quot;:&quot;4099&quot; }}
{ &quot;meta&quot;:{ &quot;op&quot;:&quot;upd&quot;, &quot;table&quot;:&quot;BILL.PRODUCTS&quot; }, &quot;data&quot;:{ &quot;PRICE&quot;:&quot;4000&quot; }, &quot;key&quot;:{ &quot;PRODUCT_ID&quot;:&quot;230117&quot;, &quot;DESCRIPTION&quot;:&quot;Hamsberry vintage tee, cherry&quot;, &quot;PRICE&quot;:&quot;3599&quot; }}

变成

{ &quot;meta&quot;:{ &quot;op&quot;:&quot;upd&quot;, &quot;table&quot;:&quot;BILL.PRODUCTS&quot; }, &quot;data&quot;:{ &quot;DESCRIPTION&quot;:&quot;XXX&quot; }, &quot;key&quot;:{ &quot;PRODUCT_ID&quot;:&quot;230117&quot;, &quot;DESCRIPTION&quot;:&quot;Hamsberry vintage tee, cherry&quot;, &quot;PRICE&quot;:4099 }}
{ &quot;meta&quot;:{ &quot;op&quot;:&quot;upd&quot;, &quot;table&quot;:&quot;BILL.PRODUCTS&quot; }, &quot;data&quot;:{ &quot;PRICE&quot;:4000 }, &quot;key&quot;:{ &quot;PRODUCT_ID&quot;:&quot;230117&quot;, &quot;DESCRIPTION&quot;:&quot;Hamsberry vintage tee, cherry&quot;, &quot;PRICE&quot;:3599 }}

（区别只是PRICE值周围的引号）。那么你的脚本会得到以下输出：

+--------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+
|afterImage                                                                                                          |parsedProducts                                     |
+--------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+
|{&quot;DESCRIPTION&quot;:&quot;XXX&quot;,&quot;PRODUCT_ID&quot;:&quot;230117&quot;,&quot;PRICE&quot;:4099,&quot;OLD_FIELD_A&quot;:&quot;&quot;,&quot;OLD_FIELD_B&quot;:&quot;&quot;}                          |{230117, XXX, 4099, null}                          |
|{&quot;PRICE&quot;:4000,&quot;PRODUCT_ID&quot;:&quot;230117&quot;,&quot;DESCRIPTION&quot;:&quot;Hamsberry vintage tee, cherry&quot;,&quot;OLD_FIELD_A&quot;:&quot;&quot;,&quot;OLD_FIELD_B&quot;:&quot;&quot;}|{230117, Hamsberry vintage tee, cherry, 4000, null}|
+--------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+

root
 |-- afterImage: string (nullable = true)
 |-- parsedProducts: struct (nullable = true)
 |    |-- PRODUCT_ID: string (nullable = true)
 |    |-- DESCRIPTION: string (nullable = true)
 |    |-- PRICE: integer (nullable = true)
 |    |-- OLD_FIELD_1: string (nullable = true)

英文:

The quotes around the values of your PRICE field are messing this up.

If you change your input data from:

{ &quot;meta&quot;:{ &quot;op&quot;:&quot;upd&quot;, &quot;table&quot;:&quot;BILL.PRODUCTS&quot; }, &quot;data&quot;:{ &quot;DESCRIPTION&quot;:&quot;XXX&quot; }, &quot;key&quot;:{ &quot;PRODUCT_ID&quot;:&quot;230117&quot;, &quot;DESCRIPTION&quot;:&quot;Hamsberry vintage tee, cherry&quot;, &quot;PRICE&quot;:&quot;4099&quot; }}
{ &quot;meta&quot;:{ &quot;op&quot;:&quot;upd&quot;, &quot;table&quot;:&quot;BILL.PRODUCTS&quot; }, &quot;data&quot;:{ &quot;PRICE&quot;:&quot;4000&quot; }, &quot;key&quot;:{ &quot;PRODUCT_ID&quot;:&quot;230117&quot;, &quot;DESCRIPTION&quot;:&quot;Hamsberry vintage tee, cherry&quot;, &quot;PRICE&quot;:&quot;3599&quot; }}

{ &quot;meta&quot;:{ &quot;op&quot;:&quot;upd&quot;, &quot;table&quot;:&quot;BILL.PRODUCTS&quot; }, &quot;data&quot;:{ &quot;DESCRIPTION&quot;:&quot;XXX&quot; }, &quot;key&quot;:{ &quot;PRODUCT_ID&quot;:&quot;230117&quot;, &quot;DESCRIPTION&quot;:&quot;Hamsberry vintage tee, cherry&quot;, &quot;PRICE&quot;:4099 }}
{ &quot;meta&quot;:{ &quot;op&quot;:&quot;upd&quot;, &quot;table&quot;:&quot;BILL.PRODUCTS&quot; }, &quot;data&quot;:{ &quot;PRICE&quot;:4000 }, &quot;key&quot;:{ &quot;PRODUCT_ID&quot;:&quot;230117&quot;, &quot;DESCRIPTION&quot;:&quot;Hamsberry vintage tee, cherry&quot;, &quot;PRICE&quot;:3599 }}

(the difference is just the quotes around the PRICE values).

Then you get this output from your script:

+--------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+
|afterImage                                                                                                          |parsedProducts                                     |
+--------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+
|{&quot;DESCRIPTION&quot;:&quot;XXX&quot;,&quot;PRODUCT_ID&quot;:&quot;230117&quot;,&quot;PRICE&quot;:4099,&quot;OLD_FIELD_A&quot;:&quot;&quot;,&quot;OLD_FIELD_B&quot;:&quot;&quot;}                          |{230117, XXX, 4099, null}                          |
|{&quot;PRICE&quot;:4000,&quot;PRODUCT_ID&quot;:&quot;230117&quot;,&quot;DESCRIPTION&quot;:&quot;Hamsberry vintage tee, cherry&quot;,&quot;OLD_FIELD_A&quot;:&quot;&quot;,&quot;OLD_FIELD_B&quot;:&quot;&quot;}|{230117, Hamsberry vintage tee, cherry, 4000, null}|
+--------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+
root
|-- afterImage: string (nullable = true)
|-- parsedProducts: struct (nullable = true)
|    |-- PRODUCT_ID: string (nullable = true)
|    |-- DESCRIPTION: string (nullable = true)
|    |-- PRICE: integer (nullable = true)
|    |-- OLD_FIELD_1: string (nullable = true)

No null values for PRICE anymore!!

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

from_json输出在Spark Dataframe模式中定义为Int时保存为null

问题

答案1

从复杂的服务器响应中提取字段

将Golang中的JSON API调用响应输出到Next.js前端。

将HTML表格中的单元格从简单文本更改为超链接。

我无法在Golang的Colly中将数据并排打印在JSON数组中。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论