Interoperability : sharing Datasets of objects or Row between Java and Scala, two ways. I put a Scala dataset operation in the middle of Java ones

huangapple go评论81阅读模式
英文:

Interoperability : sharing Datasets of objects or Row between Java and Scala, two ways. I put a Scala dataset operation in the middle of Java ones

问题

目前,我的主要应用程序是基于 JavaSpring Boot 构建的,这不会改变,因为这很方便。
@Autowired 的服务bean实现,例如:

  • Enterpriseestablishment 数据集。第一个数据集还能够返回具有其下属机构的 Enterprise 对象列表,这些机构用一个 Map 表示。
    因此,服务返回:Dataset<Enterprise>Dataset<Establishment>Dataset<Row>
  • 关联关系:Dataset<Row>
  • 城市:Dataset<Commune>Dataset<Row>
  • 地方当局:Dataset<Row>

许多用例函数是这种调用类型的:
> 什么是关联关系(年份=2020)?

然后我的应用程序会转发到 datasetAssociation(2020),该函数使用企业和机构数据集以及城市和地方当局数据集来提供有用的结果。

许多人建议我充分利用 Scala 的能力

为此,我正在考虑涉及数据集之间的其他操作:

  • 一些由 Row 构成,
  • 一些携带具体对象。

我需要执行以下操作,涉及的数据集如下:
associations.enterprises.establishments.cities.localautorities

我能否将粗体部分用 Scala 编写?这意味着:

  1. Java 代码构建的 Dataset<Row> 被发送到一个 Scala 函数以进行补充。

  2. Scala 创建一个新的数据集,其中包含 EnterpriseEstablishment 对象。
    a) 如果对象的源代码是用 Scala 编写的,我不必在 Java 中重新创建新的源代码。
    b) 相反地,如果对象的源代码是用 Java 编写的,我不必在 Scala 中重新创建新的源代码。
    c) 我可以直接在 Java 方面使用由该数据集返回的 Scala 对象。

  3. Scala 将不得不调用在 Java 中保留实现的函数,并将正在创建的底层数据集发送给它们(例如以城市信息补充数据集)。

Java 随时调用 Scala 方法,
Scala 也随时调用 Java 方法:

一个操作可能遵循以下路径:
Java -> Scala -> Scala -> Java -> Scala -> Java -> Java
根据所调用方法的本地语言。
因为我事先不知道哪些部分我会发现有用来在 Scala 中移植,哪些不会。

总结一下,我会考虑 JavaScala 能够在两者之间相互操作,相互受益。

但我能否在 Spark 2.4.x 或更有可能的是 Spark 3.0.0 中实现这个目标?

总之,JavaScala 是否可以双向互操作,而不会使源代码在一侧或另一侧变得过于冗长。或者更糟糕:重复。
而且不会显著降低性能(例如,为了转换数据集或其中包含的每个对象,不管在一侧还是另一侧,都需要重新创建整个数据集)。

英文:

Currently, my main application is built with Java Spring-boot and this won't change because it's convenient.
@Autowired service beans implements, for example :

  • Enterprise and establishment datasets. The first one is also able to return a list of Enterprise objects that have a Map of their establishments.
    So the service returns : Dataset<Enterprise>, Dataset<Establishment>, Dataset<Row>
  • Associations : Dataset<Row>
  • Cities : Dataset<Commune> or Dataset<Row>,
  • Local authorities : Datatset<Row>.

Many user case functions are calls of this kind :
> What are associations(year=2020) ?

And my applications forward to datasetAssociation(2020) that operates with enterprises and establishments datasets and with cities and local authorities ones to provide an useful result.

For this, I'm considering an operation involving other ones between datasets :

  • Some made of Row,
  • Some carrying concrete objects.

I have this operation to do, in term of datasets reached/involved :
associations.enterprises.establishments.cities.localautorities

Will I be able to write the bold part in Scala ? This means that :

  1. A Dataset<Row> built with Java code is sent to a Scala function to be completed.

  2. Scala creates a new dataset with Enterprise and Establishment objects.
    a) If the source of an object is written in Scala I don't have to recreate a new source for it in Java.
    b) conversely if the source of an object is written in Java, I don't have to recreate a new source in Scala.
    c) I can use a Scala object returned by this dataset on Java side directly.

  3. Scala will have to call functions kept implemented in Java and send them the underlying dataset it is creating (for example to complete them with cities information).

Java calls Scala methods at anytime
and Scala calls Java methods at anytime too :

an operation could follow a
Java -> Scala -> Scala -> Java -> Scala -> Java -> Java
path if wished, in term of native language of method called.
Because I don't know in advance what parts I will find useful to port in Scala or not.

Completing these three points, I will consider that Java and Scala are able interoperable the two way and benefit one from the other.

But may I achieve this goal (in Spark 2.4.x or more probably in Spark 3.0.0) ?

Summarizing, are Java and Scala interoperable the two ways, a manner that :

  • It does not make the source code too clumsy one side or the other. Or worst : duplicated.
  • It don't degrade performances strongly (having to recreate a whole dataset or convert each of the object it contains, one side or the other, for example, would be prohibitive).

答案1

得分: 2

正如Jasper-M所写,Scala和Java代码完全可以互操作:

  • 它们都会编译成.class文件,并且由JVM以相同的方式执行。
  • Spark的Java和Scala API可以一起使用,有一些特定的地方:
    • 二者都使用相同的Dataset类,因此在这方面没有问题。
    • 然而,SparkContextRDD(以及所有的RDD变体)具有Scala API,在Java中并不实用。主要是因为Scala方法以Scala类型作为输入,而这些类型在Java中不常用。但是,它们都有对应的Java封装(JavaSparkContextJavaRDD)。在Java中编码时,您可能已经见过这些封装。

现在,正如许多人推荐的,因为Spark首先是一个Scala库,而Scala语言比Java更强大(*),使用Scala编写Spark代码会更容易。此外,您会发现有更多的Scala代码示例。往往很难找到复杂的Dataset操作的Java代码示例。

因此,我认为您应该注意的两个主要问题是:

  1. (与Spark无关,但是必要的)拥有一个可以同时编译两种语言并实现双向互操作性的项目。我认为sbt可以直接提供,而对于maven,您需要使用Scala插件,并且(根据我的经验)将Java和Scala文件放在Java文件夹中。否则,其中一个语言可以调用另一个语言,但反之不行(Scala可以调用Java,但Java无法调用Scala,或者反过来)。
  2. 每次创建类型化的Dataset(即Dataset[YourClass],而不是Dataset<Row>)时,都应该注意所使用的编码器。对于Java和Java模型类,您需要显式地使用Encoders.bean(YourClass.class)。但在Scala中,默认情况下,Spark会隐式地找到编码器,并且编码器是为Scala的case类(“Product类型”)和Scala标准集合构建的。所以要注意使用哪些编码器。例如,如果您在Scala中创建一个YourJavaClass的数据集,我认为您可能需要明确地为其指定Encoders.bean(YourJavaClass.class),以使其正常工作并且没有序列化问题。

最后一点注意:您提到您使用Java Spring Boot。因此,

  • 请注意,Spring的设计与Scala/函数式的推荐实践完全不同。在Scala中到处使用null和可变的内容。您仍然可以在Scala中使用Spring,但这可能在Scala中会显得奇怪,社区可能不会轻易接受这种做法。
  • 您可以从Spring上下文调用Spark代码,但不应该从Spark中使用Spring(上下文),特别是在由Spark分发的方法中,比如在rdd.map中。这将尝试在每个工作节点上创建Spring上下文,这非常缓慢并且很容易失败。

(*)关于“Scala比Java更强大”的观点:我并不是说Scala比Java更好(尽管我确实这么认为,但这是一种口味的问题 :))。我想表达的是,Scala语言比Java更具表现力。基本上,用更少的代码做更多的事情。主要区别在于:

  • 隐式转换,在Spark API中被广泛使用
  • Monad和for-comprehension
  • 当然还有强大的类型系统(例如,了解一下协变类型,Scala中的List[Dog]List[Animal]的子类,而在Java中不是)
英文:

As Jasper-M wrote, scala and java code are perfectly inter-operable:

  • they both compile into .class files that are executed the same way by the jvm
  • The spark java and scala API works together, with couple of specifics:
    • Both use the same Dataset class, so there are no issue there
    • However SparkContext and RDD (and all RDD variants) have scala api that aren't practical in java. Mainly because scala methods takes scala type as input that are not those you use in java. But there are java wrapper for both of them (JavaSparkContext, JavaRDD). Coding in java, you probably have seen those wrapper already.

Now, as many have recommended, spark being a scala library first, and the scala language being more powerful than java (*), using scala to write spark code will be much easier. Also, you will find much more code-example in scala. It is often difficult to find java code example for complex Dataset manipulation.

So, I think the two main issues you should be taking care of are:

  1. (not spark related, but necessary) have a project that compiles both language and allows two-way inter-operability. I think sbt provides it out-of-the-box, and with maven you need to use the scala plugin and (from my experience) put both java and scala files in the java folder. Otherwise one can call the other, but not the opposite (scala call java but java cannot call scala, or the other way around)
  2. You should be careful of the encoder that are used each time you create a typed Dataset (i.e. Dataset[YourClass] and not Dataset&lt;Row&gt;). In Java, and for java model classes, you need to use Encoders.bean(YourClass.class) explicitely. But in scala, by default spark find the encoder implicitly, and the encoders are build for scala case classes ("Product types") and scala standard collections. So just be mindful of which encoders are used. For example, if you create a Dataset of YourJavaClass in scala, I think you will probably have to give explicitly the Encoders.bean(YourJavaClass.class) for it to work and not have serialization issues.

One last note: you wrote that you use java Spring-boot. So

  • Be aware that Spring design goes completely against scala/functional recommended practice. Using null and mutable stuff all over. You can still use Spring, but it might be strange in scala, and the community will probably not accept it easily.
  • You can call spark code from a spring context, but should not use spring (context) from spark, especially inside methods distributed by spark, such as in rdd.map. This will attempt to create Spring context in each worker which is very slow and can easily fail.

(*) About "scala being more powerful than java": I don't mean that scala is better than java (well I do think so, but it is a matter of taste :). What I mean is that the scala language provides much more expressiveness than java. Basically it does more with less code. The main differences are:

  • implicits, which are heavily used by spark api
  • monad + for-comprehension
  • and of course the powerful type-system (read about co-variant types for example, a List[Dog] is a subclass of List[Animal] in scala, but not in java)

答案2

得分: 0

是的,这是可能的,而不会出现性能降低或过于繁琐的额外代码。

Scala 和 Java 几乎可以完美地互操作,而且 Spark 的 Dataset API 在 Java 和 Scala 之间是共享的。无论您是在使用 Java 还是 Scala,Dataset 类都是完全相同的。正如您可以在 javadocscaladoc 中看到的(请注意,它们在布局上不同,但内容相同),Java 和 Scala 代码是完全可以互换的。最多,Scala 代码会更加简洁。

英文:

Yes, it is possible without performance degradations or overly clumsy extra code.
Scala and Java are almost perfectly interoperable and moreover the Spark Dataset API is shared between Java and Scala. The Dataset class is exactly the same whether you are using Java or Scala. As you can see in the javadoc or scaladoc (note they only differ in layout, not in content) the Java and Scala code is perfectly interchangeable. At most the Scala code will be a bit more succinct.

huangapple
  • 本文由 发表于 2020年9月23日 03:28:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/64016506.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定