使用Java Spark将嵌套数组展开为新列。

huangapple go评论78阅读模式
英文:

Explode a nested array into new columns using Java Spark

问题

以下是翻译好的代码部分:

我有一个嵌套数组我想把其中的所有元素放入新的列中到目前为止我有以下代码尝试编写了两种方法但都没有成功当前未被注释的代码导致以下错误

> `由于数据类型不匹配无法解析 'split(response.indicator, ',')'参数 1 需要字符串类型'response.indicator' 的类型为 array<struct<_VALUE:string,_number:bigint>>;;`

```python
File.withColumn("response.indicator", explode(col("response.ind")))
                .withColumn("response.indicator", split(col("response.indicator"), ","))
                .withColumn("key", col("response.indicator").getItem(1))
                .withColumn("value", col("response.indicator").getItem(0))
                .groupBy("ID")
                .pivot("key")
                .agg(first("value"))
                .show(true);

以下是模式:

     |-- ID: integer (nullable = true)
     |-- response: struct (nullable = true)
     |    |-- indicator: array (nullable = true)
     |    |    |-- element: struct (containsNull = true)
     |    |    |    |-- _VALUE: string (nullable = true)
     |    |    |    |-- _number: long (nullable = true)

我的数据样式如下:

    +--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |ID                  |response     
    +--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | 1                  |[WrappedArray([N,7], [N,8], [N,9], [N,19], [N,20], [N,22], [N,12], [N,1], [N,2], [N,3], [N,4], [N,5], [N,6], [N,10], [N,11], [N,13], [N,14], [N,15], [N,16], [N,17], [N,18], [N,21], [N,25], [N,26])]  |  
    | 2                  |[WrappedArray([Y,1], [N,8], [N,9], [N,19], [N,22], [Y,22], [N,20], [Y,7], [Y,23], [N,3], [Y,4], [N,11], [N,6], [Y,27], [N,5], [N,13], [N,14], [N,15], [Y,16], [N,17], [Y,18], [N,21], [N,25], [N,26])]	|
    +--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

我希望它的样子如下:

    +--------------------+-----------------------------+
    |ID                  | 1 | 2    | 3 |  etc 
    +--------------------+-----------------------------+
    | 1                  | N | N    | N | etc 
    | 2                  | Y | NULL | N |  etc
    +--------------------+-----------------------------+

如果你需要进一步的帮助,请随时告诉我。

英文:

I have a nested array in which I want to take all the elements in and put them each into a new column. This is what I have so far. Tried writing 2 methods but neither worked. Current error I'm getting from the uncommented code is

> cannot resolve 'split(response.indicator, ',')' due to data type mismatch: argument 1 requires string type, however, 'response.indicator' is of array<struct<_VALUE:string,_number:bigint>> type.;;

File.withColumn("response.indicator", explode(col("response.ind")))
            .withColumn("response.indicator", split(col("response.indicator"), ","))
            .withColumn("key", col("response.indicator").getItem(1))
            .withColumn("value", col("response.indicator").getItem(0))
            .groupBy("ID")
            .pivot("key")
            .agg(first("value"))
            .show(true);

    /*File.select("response.indicator").collectAsList().forEach(row -> {
        String name = String.valueOf(row.getList(0).get(1));
        String value = String.valueOf(row.getList(0).get(0));
        File.withColumn(name, col(value));
    });*/

Here is the schema

 |-- ID: integer (nullable = true)
 |-- response: struct (nullable = true)
 |    |-- indicator: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |-- _number: long (nullable = true)

What my data looks like

+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ID                  |response     
+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 1                  |[WrappedArray([N,7], [N,8], [N,9], [N,19], [N,20], [N,22], [N,12], [N,1], [N,2], [N,3], [N,4], [N,5], [N,6], [N,10], [N,11], [N,13], [N,14], [N,15], [N,16], [N,17], [N,18], [N,21], [N,25], [N,26])]  |  
| 2                  |[WrappedArray([Y,1], [N,8], [N,9], [N,19], [N,22], [Y,22], [N,20], [Y,7], [Y,23], [N,3], [Y,4], [N,11], [N,6], [Y,27], [N,5], [N,13], [N,14], [N,15], [Y,16], [N,17], [Y,18], [N,21], [N,25], [N,26])]	|
+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

What I want it to look like

+--------------------+-----------------------------+
|ID                  | 1 | 2    | 3 |  etc 
+--------------------+-----------------------------+
| 1                  | N | N    | N | etc 
| 2                  | Y | NULL | N |  etc
+--------------------+-----------------------------+                                                                                                                                                                            

答案1

得分: 0

你的问题是 split(col("response.indicator"), ",") 预期一个字符串列,而 response.indicator 实际上是一个结构体。要"展开"一个名为s的结构体,你可以像下面这样使用s.*

// 我使用模式中提供的名称,而不是你的代码中的名称。
File.withColumn("indicator", explode(col("response.indicator")))
    .select("ID", "indicator.*")
    .groupBy("ID")
    .pivot("_number")
    .agg(first("_value"))
    .show();
英文:

You problem is split(col("response.indicator"), ",") expects a string column whereas response.indicator actually is a struct. To "unfold" a struct named s, you can use s.* as follows:

// I use the names provided in the schema, not the ones from your code.
File.withColumn("indicator", explode(col("response.indicator")))
    .select("ID", "indicator.*")
    .groupBy("ID")
    .pivot("_number")
    .agg(first("_value"))
    .show();

huangapple
  • 本文由 发表于 2023年3月7日 13:07:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/75658224.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定