2020年8月28日 00:54:05go评论70阅读模式

英文:

Apache Spark having Dataset of a parameterised/generic class in Java

问题

我一直想知道在Java中是否可以拥有一个参数化/通用类的数据集（Dataset）。
为了更清楚，我希望实现类似这样的内容：

Dataset<MyClass<Integer>> myClassInteger;
Dataset<MyClass<String>> myClassString;

请告诉我是否可以实现这一点。如果您还可以向我展示如何实现这一点，我将非常感谢。谢谢！

英文:

I've always wandered if having a Dataset of a parameterised/generic class is possible in Java.
To be more clear, what I am looking to achieve is something like this:

Dataset&lt;MyClass&lt;Integer&gt;&gt; myClassInteger;
Dataset&lt;MyClass&lt;String&gt;&gt; myClassString;

Please let me know if this is possible. If you could also show me how to achieve this, I would be very appreciative. Thanks!

答案1

得分: 1

抱歉，这个问题有点旧了，但我想记录一些注释，因为我能够在Java中使用通用/参数化类来创建数据集（Datasets）。我通过创建一个带有类型参数的通用类，并在其中放置了一些方法来实现这一点。例如，class MyClassProcessor<T1>，其中T1可以是Integer或String。

不幸的是，在这种情况下，您将无法充分享受通用类型的全部好处，需要进行一些解决方法：

我不得不使用Encoders.kryo()，否则通用类型将变为具有某些操作的Object，并且无法正确转换为通用类型。
- 这引入了一些其他烦恼，比如无法执行连接操作。我不得不使用诸如使用元组（Tuples）等技巧来允许一些连接操作。
我还没有尝试读取通用类型，我的参数化类是在稍后通过map引入的。例如，我首先读取了TypeA，然后使用Dataset<MyClass<TypeA>>进行操作。
我能够在通用类型中使用更复杂、自定义的类型，而不仅仅是Integer、String等...
还有一些烦人的细节，比如必须传递类字面值，例如TypeA.class，并且在某些映射函数中使用原始类型（raw Types）等...

英文:

Sorry this question is old, but I wanted to put some notes down since I was able to work with generic/parameterized classes for Datasets in java by creating a generic class that took a type parameter, and subsequently put methods inside that parameterized class. Ie, class MyClassProcessor<T1> where T1 could be Integer or String.

Unfortunately, you will not enjoy full benefits of generic types in this case, and you will have to perform some workarounds:

I had to use Encoders.kryo(), otherwise the generic types became Object with some operations and could not be cast correctly to the generic type.
- This introduces some other annoyances, ie can't join. I had to use tricks like using Tuples to allow for some join operations.
I haven't tried reading generic types, my parameterized classes were introduced later using map. For example, I read TypeA and later worked with Dataset<MyClass<TypeA>>.
I was able to use more complex, custom types in the generics, not just Integer, String, etc...
There were some annoying details like having to pass along Class literals, ie TypeA.class and using raw Types for certain map functions etc...

答案2

得分: -1

是的，您可以拥有自己的类的数据集。它会类似于 Dataset<MyOwnClass>

在下面的代码中，我尝试读取文件内容并将其放入我们创建的类的数据集中。请查看下面的代码片段。

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;

import java.io.Serializable;

public class FileDataset {
    public static class Employee implements Serializable {
        public int key;
        public int value;
    }

    public static void main(String[] args) {
        // 配置 Spark
        SparkSession spark = SparkSession
                .builder()
                .appName("Reading JSON File into DataSet")
                .master("local[2]")
                .getOrCreate();

        final Encoder<Employee> employeeEncoder = Encoders.bean(Employee.class);

        final String jsonPath = "/Users/ajaychoudhary/Documents/student.txt";

        // 读取 JSON 文件到数据集
        Dataset<Employee> ds = spark.read()
                .json(jsonPath)
                .as(employeeEncoder);
        ds.show();
    }
}

student.txt 文件的内容是：

{ "key": 1, "value": 2 }
{ "key": 3, "value": 4 }
{ "key": 5, "value": 6 }

它在控制台上产生以下输出：

+---+-----+
|key|value|
+---+-----+
|  1|    2|
|  3|    4|
|  5|    6|
+---+-----+

希望这能给您一个关于如何使用自定义类创建数据集的初步想法。

英文:

Yes, you can have Dataset of your own class. It Would look like Dataset<MyOwnClass>

In the code below I have tried to read a file content and put it in the Dataset of the class that we have created. Please check the snippet below.

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;

import java.io.Serializable;

public class FileDataset {
    public static class Employee implements Serializable {
        public int key;
        public int value;
    }

    public static void main(String[] args) {
        // configure spark
        SparkSession spark = SparkSession
                .builder()
                .appName(&quot;Reading JSON File into DataSet&quot;)
                .master(&quot;local[2]&quot;)
                .getOrCreate();

        final Encoder&lt;Employee&gt; employeeEncoder = Encoders.bean(Employee.class);

        final String jsonPath = &quot;/Users/ajaychoudhary/Documents/student.txt&quot;;

        // read JSON file to Dataset
        Dataset&lt;Employee&gt; ds = spark.read()
                .json(jsonPath)
                .as(employeeEncoder);
        ds.show();
    }
}

The content of my student.txt file is

{ &quot;key&quot;: 1, &quot;value&quot;: 2 }
{ &quot;key&quot;: 3, &quot;value&quot;: 4 }
{ &quot;key&quot;: 5, &quot;value&quot;: 6 }

It produces the following output on the console:

+---+-----+
|key|value|
+---+-----+
|  1|    2|
|  3|    4|
|  5|    6|
+---+-----+

I hope this gives you an initial idea of how you can have the dataset of your own custom class.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Apache Spark中具有Java中参数化/泛型类的数据集

问题

答案1

答案2

如何从Volley JSON响应中移除用户？

Hibernate查询参数绑定引发QuerySyntaxException。

Zabbix和JMX连接被拒绝。

如何将PrintWriter用作布尔方法中的参数。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论