英文:
Convert array of array of strings to a spark dataframe of array of strings in java
问题
我试图将 string[][]
转换为由 string[]
组成的 Dataset<Row>
列。
我已经查阅了文档并在网上查找了可用的示例,但没有找到类似这样的内容。我不知道是否可能,因为我是 Spark 的完全新手。
示例输入:
String[][] test = {{"test1"}, {"test2", "test3"}, {"test4", "test5"}};
示例输出:
Dataset<Row> test_df
test_df.show()
+-------------+
| foo|
+-------------+
| [test1]|
|[test2,test3]|
|[test4,test5]|
+-------------+
我可能错误地定义了 string[][]
的 structType,我尝试过不同的方法。
以下是我尝试的内容:
String[][] test = {{"test1"}, {"test2", "test3"}, {"test4", "test5"}};
List<String[]> test1 = Arrays.asList(test);
StructType structType = DataTypes.createStructType(
DataTypes.createStructField(
"foo",
DataTypes.createArrayType(DataTypes.StringType),
true));
Dataset<Row> t = spark.createDataFrame(test1, structType);
t.show();
英文:
I'm trying to convert a string[][]
into a Dataset<Row>
column consisting of string[]
.
I have gone through the docs and available examples online but could not find something similar to this. I don't know if its possible or not as I'm a complete beginner in spark.
Sample input:
String[][] test = {{"test1"}, {"test2", "test3"}, {"test4", "test5"}};
Sample output:
Dataset<Row> test_df
test_df.show()
+-------------+
| foo|
+-------------+
| [test1]|
|[test2,test3]|
|[test4,test5]|
+-------------+
I'm probably defining the structType wrong for string[][], I've tried different ways too.
Here's what I'm trying to do:
String[][] test = {{"test1"}, {"test2", "test3"}, {"test4", "test5"}};
List<String[]> test1 = Arrays.asList(test);
StructType structType = DataTypes.createStructType(
DataTypes.createStructField(
"foo",
DataTypes.createArrayType(DataTypes.StringType),
true));
Dataset<Row> t = spark.createDataFrame(test1, structType);
t.show();
答案1
得分: 1
你的代码问题在于你试图使用一个方法 (spark.createDataFrame(List<Row>, StructType)
),该方法接受一个 Row
对象的列表。但你却将其与一个数组的列表一起使用。
有几种方法可以解决这个问题:
- 从每个数组创建一个
Row
,然后应用你一直在使用的方法。 - 使用 bean 编码器创建一个字符串数组的数据集,然后使用行编码器将其转换为
Row
的数据集。 - 使用 Java Bean 来创建 DataFrame。
我认为最简单的是最后一种方法,以下是如何操作。你需要定义一个只有一个字符串数组实例变量的小型 Java Bean。
public static class ArrayWrapper {
private String[] foo;
public ArrayWrapper(String[] foo) {
this.foo = foo;
}
public String[] getFoo() {
return foo;
}
public void setFoo(String[] foo) {
this.foo = foo;
}
}
确保 Java Bean 具有接受字符串数组的构造函数。
然后,要创建 DataFrame,首先从数组的数组中创建一个 ArrayWrapper
(即你的 Java Bean)的列表,然后使用 createDataFrame(List<?>, Class<?>)
方法创建一个 DataFrame。
String[][] test = {{"test1"}, {"test2", "test3"}, {"test4", "test5"}};
List<ArrayWrapper> list = Arrays.stream(test).map(ArrayWrapper::new).collect(Collectors.toList());
Dataset<Row> testDF = spark.createDataFrame(list, ArrayWrapper.class);
testDF.show();
列的名称由 Java Bean 中实例变量的名称确定。
英文:
The problem with your code is that you are trying to use a method (spark.createDataFrame(List<Row>, StructType)
) which takes a list of Row
objects. But you use it with a list of arrays.
There are several ways to overcome it:
- Create a
Row
from each of the arrays, and then apply the method you have been using. - Create a dataset of string arrays using a bean encoder and then convert it to a dataset of
Row
using a row encoder. - Create the dataframe using a Java Bean.
I think the last method is the easiest, so here is how you do it. You have to define a small Java bean whose only instance variable is a String array.
public static class ArrayWrapper {
private String[] foo;
public ArrayWrapper(String[] foo) {
this.foo = foo;
}
public String[] getFoo() {
return foo;
}
public void setFoo(String[] foo) {
this.foo = foo;
}
}
Make sure the Java Bean has a constructor that accepts a String array.
Then, to create the dataframe, you first create a list of ArrayWrapper
(your Java Bean) from the array of arrays, and then make a dataframe using the createDataFrame(List<?>,Class<?>)
method.
String[][] test = {{"test1"}, {"test2", "test3"}, {"test4", "test5"}};
List<ArrayWrapper> list = Arrays.stream(test).map(ArrayWrapper::new).collect(Collectors.toList());
Dataset<Row> testDF = spark.createDataFrame(list,ArrayWrapper.class);
testDF.show();
The name of the column is determined by the name of the instance variable in the Java Bean.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论