2020年4月6日 05:18:52go评论95阅读模式

英文:

How to cast nested struct in Java Spark (Unsupported NullType)

问题

以下是翻译好的内容：

我正在将 MongoDB 数据加载到 Hive 表中，并尝试在使用 saveAsTable 时解决不支持的 NullType 问题。
样本数据架构

root
 |-- level1: struct (nullable = true)
 |    |-- level2: struct (nullable = true)
 |    |    |-- level3_1: null (nullable = true)
 |    |    |-- level3_2: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- level4: null (nullable = true)

我尝试了使用 functions.lit 函数，例如

df = df.withColumn("level1.level2.level3_1", functions.lit("null").cast("string"))
       .withColumn("level1.level2.level3_2.level4", functions.lit("null").cast("string"))

但结果如下所示

root
 |-- level1: struct (nullable = true)
 |    |-- level2: struct (nullable = true)
 |    |    |-- level3_1: null (nullable = true)
 |    |    |-- level3_2: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- level4: null (nullable = true)
 |-- level1.level2.level3_1: string (nullable = false)
 |-- level1.level2.level3_2.level4: string (nullable = false)

我还检查了 df.na().fill()，但似乎没有改变架构。

期望的结果是

root
 |-- level1: struct (nullable = true)
 |    |-- level2: struct (nullable = true)
 |    |    |-- level3_1: string (nullable = true)
 |    |    |-- level3_2: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- level4: string (nullable = true)

这样我就可以将加载的 MongoDB 数据保存为 Hive 表了。

有人在处理类似问题时有经验吗？是否可以给予一些建议，如何转换嵌套的 NullType，或者如何在 Java 中处理 NullType。请考虑一个可扩展到更复杂数据的系统/通用解决方案。非常感谢！

英文:

I am loading mongodb to hive table and trying to solve Unsupported NullType when saveAsTable.
Sample data schema

root
 |-- level1: struct (nullable = true)
 |    |-- level2: struct (nullable = true)
 |    |    |-- level3_1: null (nullable = true)
 |    |    |-- level3_2: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- level4: null (nullable = true)

I tried functions.lit like

df = df.withColumn(&quot;level1.level2.level3_1&quot;, functions.lit(&quot;null&quot;).cast(&quot;string&quot;));
       .withColumn(&quot;level1.level2.level3_2.level4&quot;, functions.lit(&quot;null&quot;).cast(&quot;string&quot;));

but the result is like

root
 |-- level1: struct (nullable = true)
 |    |-- level2: struct (nullable = true)
 |    |    |-- level3_1: null (nullable = true)
 |    |    |-- level3_2: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- level4: null (nullable = true)
 |-- level1.level2.level3_1: string (nullable = false)
 |-- level1.level2.level3_2.level4: string (nullable = false)

I also checked df.na().fill() but this seems not changing the schema.

The desired result is

root
 |-- level1: struct (nullable = true)
 |    |-- level2: struct (nullable = true)
 |    |    |-- level3_1: string (nullable = true)
 |    |    |-- level3_2: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- level4: string (nullable = true)

and that I can use loaded mongodb data save as table to hive

Does anyone have worked on this and could give me some advise that how to cast nested nulltype or how to deal with nulltype in java. Think of a systematic/general solution that can scale for more complex data.
Many thanks

答案1

得分: 1

StructType schema = createStructType(Arrays.asList(
    createStructField("level1", createStructType(Arrays.asList(
        createStructField("level2", createStructType(Arrays.asList(
            createStructField("level3_1", StringType, true),
            createStructField("level3_2", createArrayType(createStructType(Arrays.asList(
                createStructField("level4", StringType, true)))), true)
            )), true))), true)));

// Replace new ArrayList<>() to your dataset.
Dataset<Row> df = ss.createDataFrame(new ArrayList<>(), schema);
df.printSchema();

root
 |-- level1: struct (nullable = true)
 |    |-- level2: struct (nullable = true)
 |    |    |-- level3_1: string (nullable = true)
 |    |    |-- level3_2: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- level4: string (nullable = true)

EDIT:

I added a more intuitive example here to deliver my thought. I hope it helps you.

@Test
public void test() {
    SparkSession ss = SparkSession.builder().master("local").appName("test").getOrCreate();

    // Step1) read your MongoDB data. (I added NullType field 'level' manually for explanation.)
    // https://docs.mongodb.com/spark-connector/master/python/read-from-mongodb/
    Dataset<Row> data = ss.read().json("test.json").withColumn("level", lit(null));
    data.printSchema();

    StructType schema = createStructType(Arrays.asList(
        createStructField("_id", LongType, true),
        createStructField("level", StringType, true)));

    // Step2) create newData using the schema you defined.
    Dataset<Row> newData = ss.createDataFrame(data.collectAsList(), schema);
    newData.printSchema();

    // Step3) load newData to Hive
}

英文:

One idea is a creating schema with StringType and read data with the schema.

StructType schema = createStructType(Arrays.asList(
    createStructField(&quot;level1&quot;, createStructType(Arrays.asList(
        createStructField(&quot;level2&quot;, createStructType(Arrays.asList(
            createStructField(&quot;level3_1&quot;, StringType, true),
            createStructField(&quot;level3_2&quot;, createArrayType(createStructType(Arrays.asList(
                createStructField(&quot;level4&quot;, StringType, true)))), true)
            )), true))), true)));

// Replace new ArrayList&lt;&gt;() to your dataset.
Dataset&lt;Row&gt; df = ss.createDataFrame(new ArrayList&lt;&gt;(), schema);
df.printSchema();

root
 |-- level1: struct (nullable = true)
 |    |-- level2: struct (nullable = true)
 |    |    |-- level3_1: string (nullable = true)
 |    |    |-- level3_2: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- level4: string (nullable = true)

<br>

EDIT:

I added more intuitive example here to deliver my thought. I hope it helps you.

@Test
public void test() {
    SparkSession ss = SparkSession.builder().master(&quot;local&quot;).appName(&quot;test&quot;).getOrCreate();

    // Step1) read your mongoDB data. (I added NullType field &#39;level&#39; manually for explaination.
    // https://docs.mongodb.com/spark-connector/master/python/read-from-mongodb/
    Dataset&lt;Row&gt; data = ss.read().json(&quot;test.json&quot;).withColumn(&quot;level&quot;, lit(null));
    data.printSchema();

    StructType schema = createStructType(Arrays.asList(
        createStructField(&quot;_id&quot;, LongType, true),
        createStructField(&quot;level&quot;, StringType, true)));

    // Step2) create newData using schema you defined. 
    Dataset&lt;Row&gt; newData = ss.createDataFrame(data.collectAsList(), schema);
    newData.printSchema();

    // Step3) load newData to Hive
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Java Spark中转换嵌套的结构体（不支持的NullType）

问题

答案1

如何在安卓的默认键盘上添加贴纸？

如何在Hazelcast的IMap中选择多个键？

如何在使用Java的流(Streams)进行比较映射(Maps)时处理空值？

Java 格式化整行代码

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论