Apache Spark中具有Java中参数化/泛型类的数据集

huangapple go评论65阅读模式
英文:

Apache Spark having Dataset of a parameterised/generic class in Java

问题

我一直想知道在Java中是否可以拥有一个参数化/通用类的数据集(Dataset)。
为了更清楚,我希望实现类似这样的内容:

Dataset<MyClass<Integer>> myClassInteger;
Dataset<MyClass<String>> myClassString;

请告诉我是否可以实现这一点。如果您还可以向我展示如何实现这一点,我将非常感谢。谢谢!

英文:

I've always wandered if having a Dataset of a parameterised/generic class is possible in Java.
To be more clear, what I am looking to achieve is something like this:

Dataset&lt;MyClass&lt;Integer&gt;&gt; myClassInteger;
Dataset&lt;MyClass&lt;String&gt;&gt; myClassString;

Please let me know if this is possible. If you could also show me how to achieve this, I would be very appreciative. Thanks!

答案1

得分: 1

抱歉,这个问题有点旧了,但我想记录一些注释,因为我能够在Java中使用通用/参数化类来创建数据集(Datasets)。我通过创建一个带有类型参数的通用类,并在其中放置了一些方法来实现这一点。例如,class MyClassProcessor<T1>,其中T1可以是IntegerString

不幸的是,在这种情况下,您将无法充分享受通用类型的全部好处,需要进行一些解决方法:

  • 我不得不使用Encoders.kryo(),否则通用类型将变为具有某些操作的Object,并且无法正确转换为通用类型。
    • 这引入了一些其他烦恼,比如无法执行连接操作。我不得不使用诸如使用元组(Tuples)等技巧来允许一些连接操作。
  • 我还没有尝试读取通用类型,我的参数化类是在稍后通过map引入的。例如,我首先读取了TypeA,然后使用Dataset<MyClass<TypeA>>进行操作。
  • 我能够在通用类型中使用更复杂、自定义的类型,而不仅仅是IntegerString等...
  • 还有一些烦人的细节,比如必须传递类字面值,例如TypeA.class,并且在某些映射函数中使用原始类型(raw Types)等...
英文:

Sorry this question is old, but I wanted to put some notes down since I was able to work with generic/parameterized classes for Datasets in java by creating a generic class that took a type parameter, and subsequently put methods inside that parameterized class. Ie, class MyClassProcessor&lt;T1&gt; where T1 could be Integer or String.

Unfortunately, you will not enjoy full benefits of generic types in this case, and you will have to perform some workarounds:

  • I had to use Encoders.kryo(), otherwise the generic types became Object with some operations and could not be cast correctly to the generic type.
    • This introduces some other annoyances, ie can't join. I had to use tricks like using Tuples to allow for some join operations.
  • I haven't tried reading generic types, my parameterized classes were introduced later using map. For example, I read TypeA and later worked with Dataset<MyClass<TypeA>>.
  • I was able to use more complex, custom types in the generics, not just Integer, String, etc...
  • There were some annoying details like having to pass along Class literals, ie TypeA.class and using raw Types for certain map functions etc...

答案2

得分: -1

是的,您可以拥有自己的类的数据集。它会类似于 Dataset<MyOwnClass>

在下面的代码中,我尝试读取文件内容并将其放入我们创建的类的数据集中。请查看下面的代码片段。

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;

import java.io.Serializable;

public class FileDataset {
    public static class Employee implements Serializable {
        public int key;
        public int value;
    }

    public static void main(String[] args) {
        // 配置 Spark
        SparkSession spark = SparkSession
                .builder()
                .appName("Reading JSON File into DataSet")
                .master("local[2]")
                .getOrCreate();

        final Encoder<Employee> employeeEncoder = Encoders.bean(Employee.class);

        final String jsonPath = "/Users/ajaychoudhary/Documents/student.txt";

        // 读取 JSON 文件到数据集
        Dataset<Employee> ds = spark.read()
                .json(jsonPath)
                .as(employeeEncoder);
        ds.show();
    }
}

student.txt 文件的内容是:

{ "key": 1, "value": 2 }
{ "key": 3, "value": 4 }
{ "key": 5, "value": 6 }

它在控制台上产生以下输出:

+---+-----+
|key|value|
+---+-----+
|  1|    2|
|  3|    4|
|  5|    6|
+---+-----+

希望这能给您一个关于如何使用自定义类创建数据集的初步想法。

英文:

Yes, you can have Dataset of your own class. It Would look like Dataset&lt;MyOwnClass&gt;

In the code below I have tried to read a file content and put it in the Dataset of the class that we have created. Please check the snippet below.

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;

import java.io.Serializable;

public class FileDataset {
    public static class Employee implements Serializable {
        public int key;
        public int value;
    }

    public static void main(String[] args) {
        // configure spark
        SparkSession spark = SparkSession
                .builder()
                .appName(&quot;Reading JSON File into DataSet&quot;)
                .master(&quot;local[2]&quot;)
                .getOrCreate();

        final Encoder&lt;Employee&gt; employeeEncoder = Encoders.bean(Employee.class);

        final String jsonPath = &quot;/Users/ajaychoudhary/Documents/student.txt&quot;;

        // read JSON file to Dataset
        Dataset&lt;Employee&gt; ds = spark.read()
                .json(jsonPath)
                .as(employeeEncoder);
        ds.show();
    }
}

The content of my student.txt file is

{ &quot;key&quot;: 1, &quot;value&quot;: 2 }
{ &quot;key&quot;: 3, &quot;value&quot;: 4 }
{ &quot;key&quot;: 5, &quot;value&quot;: 6 }

It produces the following output on the console:

+---+-----+
|key|value|
+---+-----+
|  1|    2|
|  3|    4|
|  5|    6|
+---+-----+

I hope this gives you an initial idea of how you can have the dataset of your own custom class.

huangapple
  • 本文由 发表于 2020年8月28日 00:54:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/63620757.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定