将字符串数组的数组转换为Java中的Spark DataFrame字符串数组数组

huangapple go评论64阅读模式
英文:

Convert array of array of strings to a spark dataframe of array of strings in java

问题

我试图将 string[][] 转换为由 string[] 组成的 Dataset<Row> 列。
我已经查阅了文档并在网上查找了可用的示例,但没有找到类似这样的内容。我不知道是否可能,因为我是 Spark 的完全新手。

示例输入:
String[][] test = {{"test1"}, {"test2", "test3"}, {"test4", "test5"}};
示例输出:

Dataset<Row> test_df
test_df.show()
+-------------+
|          foo|
+-------------+
|      [test1]|
|[test2,test3]|
|[test4,test5]|
+-------------+

我可能错误地定义了 string[][] 的 structType,我尝试过不同的方法。
以下是我尝试的内容:

String[][] test = {{"test1"}, {"test2", "test3"}, {"test4", "test5"}};

List<String[]> test1 = Arrays.asList(test);

StructType structType = DataTypes.createStructType(
    DataTypes.createStructField(
               "foo", 
               DataTypes.createArrayType(DataTypes.StringType), 
               true));

Dataset<Row> t = spark.createDataFrame(test1, structType);
t.show();
英文:

I'm trying to convert a string[][] into a Dataset&lt;Row&gt; column consisting of string[].
I have gone through the docs and available examples online but could not find something similar to this. I don't know if its possible or not as I'm a complete beginner in spark.

Sample input:

String[][] test = {{&quot;test1&quot;}, {&quot;test2&quot;, &quot;test3&quot;}, {&quot;test4&quot;, &quot;test5&quot;}};

Sample output:

Dataset&lt;Row&gt; test_df
test_df.show()
+-------------+
|          foo|
+-------------+
|      [test1]|
|[test2,test3]|
|[test4,test5]|
+-------------+

I'm probably defining the structType wrong for string[][], I've tried different ways too.
Here's what I'm trying to do:


    String[][] test = {{&quot;test1&quot;}, {&quot;test2&quot;, &quot;test3&quot;}, {&quot;test4&quot;, &quot;test5&quot;}};
    
    List&lt;String[]&gt; test1 = Arrays.asList(test);
    
    StructType structType = DataTypes.createStructType(
        DataTypes.createStructField(
                   &quot;foo&quot;, 
                   DataTypes.createArrayType(DataTypes.StringType), 
                   true));
    
    Dataset&lt;Row&gt; t = spark.createDataFrame(test1, structType);
    t.show();

答案1

得分: 1

你的代码问题在于你试图使用一个方法 (spark.createDataFrame(List<Row>, StructType)),该方法接受一个 Row 对象的列表。但你却将其与一个数组的列表一起使用。

有几种方法可以解决这个问题:

  • 从每个数组创建一个 Row,然后应用你一直在使用的方法。
  • 使用 bean 编码器创建一个字符串数组的数据集,然后使用行编码器将其转换为 Row 的数据集。
  • 使用 Java Bean 来创建 DataFrame。

我认为最简单的是最后一种方法,以下是如何操作。你需要定义一个只有一个字符串数组实例变量的小型 Java Bean。

public static class ArrayWrapper {
    private String[] foo;

    public ArrayWrapper(String[] foo) {
        this.foo = foo;
    }

    public String[] getFoo() {
        return foo;
    }

    public void setFoo(String[] foo) {
        this.foo = foo;
    }
}

确保 Java Bean 具有接受字符串数组的构造函数。

然后,要创建 DataFrame,首先从数组的数组中创建一个 ArrayWrapper(即你的 Java Bean)的列表,然后使用 createDataFrame(List<?>, Class<?>) 方法创建一个 DataFrame。

String[][] test = {{"test1"}, {"test2", "test3"}, {"test4", "test5"}};
List<ArrayWrapper> list = Arrays.stream(test).map(ArrayWrapper::new).collect(Collectors.toList());
Dataset<Row> testDF = spark.createDataFrame(list, ArrayWrapper.class);
testDF.show();

列的名称由 Java Bean 中实例变量的名称确定。

英文:

The problem with your code is that you are trying to use a method (spark.createDataFrame(List&lt;Row&gt;, StructType)) which takes a list of Row objects. But you use it with a list of arrays.

There are several ways to overcome it:

  • Create a Row from each of the arrays, and then apply the method you have been using.
  • Create a dataset of string arrays using a bean encoder and then convert it to a dataset of Row using a row encoder.
  • Create the dataframe using a Java Bean.

I think the last method is the easiest, so here is how you do it. You have to define a small Java bean whose only instance variable is a String array.

public static class ArrayWrapper {
    private String[] foo;

    public ArrayWrapper(String[] foo) {
        this.foo = foo;
    }

    public String[] getFoo() {
        return foo;
    }

    public void setFoo(String[] foo) {
        this.foo = foo;
    }
}

Make sure the Java Bean has a constructor that accepts a String array.

Then, to create the dataframe, you first create a list of ArrayWrapper (your Java Bean) from the array of arrays, and then make a dataframe using the createDataFrame(List&lt;?&gt;,Class&lt;?&gt;) method.

String[][] test = {{&quot;test1&quot;}, {&quot;test2&quot;, &quot;test3&quot;}, {&quot;test4&quot;, &quot;test5&quot;}};
List&lt;ArrayWrapper&gt; list = Arrays.stream(test).map(ArrayWrapper::new).collect(Collectors.toList());
Dataset&lt;Row&gt; testDF = spark.createDataFrame(list,ArrayWrapper.class);
testDF.show();

The name of the column is determined by the name of the instance variable in the Java Bean.

huangapple
  • 本文由 发表于 2020年10月12日 23:30:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/64320811.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定