英文:
Apache Spark having Dataset of a parameterised/generic class in Java
问题
我一直想知道在Java中是否可以拥有一个参数化/通用类的数据集(Dataset)。
为了更清楚,我希望实现类似这样的内容:
Dataset<MyClass<Integer>> myClassInteger;
Dataset<MyClass<String>> myClassString;
请告诉我是否可以实现这一点。如果您还可以向我展示如何实现这一点,我将非常感谢。谢谢!
英文:
I've always wandered if having a Dataset of a parameterised/generic class is possible in Java.
To be more clear, what I am looking to achieve is something like this:
Dataset<MyClass<Integer>> myClassInteger;
Dataset<MyClass<String>> myClassString;
Please let me know if this is possible. If you could also show me how to achieve this, I would be very appreciative. Thanks!
答案1
得分: 1
抱歉,这个问题有点旧了,但我想记录一些注释,因为我能够在Java中使用通用/参数化类来创建数据集(Datasets)。我通过创建一个带有类型参数的通用类,并在其中放置了一些方法来实现这一点。例如,class MyClassProcessor<T1>
,其中T1可以是Integer
或String
。
不幸的是,在这种情况下,您将无法充分享受通用类型的全部好处,需要进行一些解决方法:
- 我不得不使用
Encoders.kryo()
,否则通用类型将变为具有某些操作的Object
,并且无法正确转换为通用类型。- 这引入了一些其他烦恼,比如无法执行连接操作。我不得不使用诸如使用元组(Tuples)等技巧来允许一些连接操作。
- 我还没有尝试读取通用类型,我的参数化类是在稍后通过
map
引入的。例如,我首先读取了TypeA
,然后使用Dataset<MyClass<TypeA>>
进行操作。 - 我能够在通用类型中使用更复杂、自定义的类型,而不仅仅是
Integer
、String
等... - 还有一些烦人的细节,比如必须传递类字面值,例如
TypeA.class
,并且在某些映射函数中使用原始类型(raw Types)等...
英文:
Sorry this question is old, but I wanted to put some notes down since I was able to work with generic/parameterized classes for Datasets in java by creating a generic class that took a type parameter, and subsequently put methods inside that parameterized class. Ie, class MyClassProcessor<T1>
where T1 could be Integer
or String
.
Unfortunately, you will not enjoy full benefits of generic types in this case, and you will have to perform some workarounds:
- I had to use
Encoders.kryo()
, otherwise the generic types becameObject
with some operations and could not be cast correctly to the generic type.- This introduces some other annoyances, ie can't join. I had to use tricks like using Tuples to allow for some join operations.
- I haven't tried reading generic types, my parameterized classes were introduced later using
map
. For example, I readTypeA
and later worked with Dataset<MyClass<TypeA>>. - I was able to use more complex, custom types in the generics, not just Integer, String, etc...
- There were some annoying details like having to pass along Class literals, ie
TypeA.class
and using raw Types for certain map functions etc...
答案2
得分: -1
是的,您可以拥有自己的类的数据集。它会类似于 Dataset<MyOwnClass>
在下面的代码中,我尝试读取文件内容并将其放入我们创建的类的数据集中。请查看下面的代码片段。
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
import java.io.Serializable;
public class FileDataset {
public static class Employee implements Serializable {
public int key;
public int value;
}
public static void main(String[] args) {
// 配置 Spark
SparkSession spark = SparkSession
.builder()
.appName("Reading JSON File into DataSet")
.master("local[2]")
.getOrCreate();
final Encoder<Employee> employeeEncoder = Encoders.bean(Employee.class);
final String jsonPath = "/Users/ajaychoudhary/Documents/student.txt";
// 读取 JSON 文件到数据集
Dataset<Employee> ds = spark.read()
.json(jsonPath)
.as(employeeEncoder);
ds.show();
}
}
student.txt
文件的内容是:
{ "key": 1, "value": 2 }
{ "key": 3, "value": 4 }
{ "key": 5, "value": 6 }
它在控制台上产生以下输出:
+---+-----+
|key|value|
+---+-----+
| 1| 2|
| 3| 4|
| 5| 6|
+---+-----+
希望这能给您一个关于如何使用自定义类创建数据集的初步想法。
英文:
Yes, you can have Dataset of your own class. It Would look like Dataset<MyOwnClass>
In the code below I have tried to read a file content and put it in the Dataset of the class that we have created. Please check the snippet below.
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
import java.io.Serializable;
public class FileDataset {
public static class Employee implements Serializable {
public int key;
public int value;
}
public static void main(String[] args) {
// configure spark
SparkSession spark = SparkSession
.builder()
.appName("Reading JSON File into DataSet")
.master("local[2]")
.getOrCreate();
final Encoder<Employee> employeeEncoder = Encoders.bean(Employee.class);
final String jsonPath = "/Users/ajaychoudhary/Documents/student.txt";
// read JSON file to Dataset
Dataset<Employee> ds = spark.read()
.json(jsonPath)
.as(employeeEncoder);
ds.show();
}
}
The content of my student.txt
file is
{ "key": 1, "value": 2 }
{ "key": 3, "value": 4 }
{ "key": 5, "value": 6 }
It produces the following output on the console:
+---+-----+
|key|value|
+---+-----+
| 1| 2|
| 3| 4|
| 5| 6|
+---+-----+
I hope this gives you an initial idea of how you can have the dataset of your own custom class.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论