英文:
Delta encoders: Using Java library in Scala
问题
以下是您要求的翻译内容:
我必须比较使用基于Spark的大数据分析数据集(文本文件),这些数据集非常相似(>98%),但大小非常大。经过一些研究,我发现最有效的方法可能是使用增量编码器(delta encoders)。通过这种方式,我可以有一个参考文本,并将其他文本存储为增量增量。然而,我使用的是Scala,它不支持增量编码器,而且我对Java一窍不通。但是由于Scala与Java是可互操作的,我知道可以让Java库在Scala中工作。
我发现了一些有希望的实现,分别是xdelta,vcdiff-java和bsdiff。经过更多搜索,我找到了最有趣的库,dez。该链接还提供了性能基准,在性能基准中它似乎表现得非常出色,并且代码是免费使用的,看起来很轻巧。
在这一点上,我陷入了在Scala中使用此库(通过sbt)的困境。我会非常感谢任何关于克服这个障碍的建议或参考,无论是针对这个问题(增量编码器)、库还是一般情况下在Scala中使用Java API。具体来说,我的问题是:
-
是否有适用于增量编码器的Scala库可以直接使用?(如果没有)
-
我是否可以将类文件(notzed.dez.jar)放入项目中,然后让sbt在Scala代码中提供API?
我在这个困境中挣扎,非常感谢任何帮助我走出困境的方法。
英文:
I have to compare using Spark-based big data analysis data sets (text files) that are very similar (>98%) but with very large sizes. After doing some research, I found that most efficient way could be to use delta encoders. With this I can have a reference text and store others as delta increments. However, I use Scala that does not have support for delta encoders, and I am not at all conversant with Java. But as Scala is interoperable with Java, I know that it is possible to get Java lib work in Scala.
I found the promising implementations to be xdelta, vcdiff-java and bsdiff. With a bit more searching, I found the most interesting library, dez. The link also gives benchmarks in which it seems to perform very well, and code is free to use and looks lightweight.
At this point, I am stuck with using this library in Scala (via sbt). I would appreciate any suggestions or references to navigate this barrier, either specific to this issue (delata encoders), library or in working with Java API in general within Scala. Specifically, my questions are:
-
Is there a Scala library for delta encoders that I can directly use? (If not)
-
Is it possible that I place the class files/notzed.dez.jar in the project and let sbt provide the APIs in the Scala code?
I am kind of stuck in this quagmire and any way out would be greatly appreciated.
答案1
得分: 1
有几个细节需要考虑。直接在Scala中使用Java库没有问题,可以将其作为sbt的依赖项或未管理的依赖项使用,详情请参阅https://www.scala-sbt.org/1.x/docs/Library-Dependencies.html:“lib中的依赖项适用于所有类路径(用于编译、测试、运行和控制台)”。您可以使用https://github.com/sbt/sbt-native-packager创建一个包含您的代码和依赖项的fat jar,并使用Spark Submit进行分发。
重点是在Spark中使用这些框架。要充分利用Spark,您需要将文件拆分为块,以便将算法分布到集群中的一个文件中。或者,如果您的文件经过压缩,并且每个文件都在一个hdfs分区中,您需要调整hdfs块的大小等...
您可以使用C模块,并将其包含在您的项目中,通过JNI调用它们,就像深度学习框架使用本地线性代数函数等一样。因此,从本质上讲,关于如何在Spark中实现这些增量算法有很多讨论的内容。
英文:
There are several details to take into account. There is no problem in using directly the Java libraries in Scala, either using as dependencies in sbt or using as unmanaged dependencies https://www.scala-sbt.org/1.x/docs/Library-Dependencies.html: "Dependencies in lib go on all the classpaths (for compile, test, run, and console)". You can create a fat jar with your code and dependencies with https://github.com/sbt/sbt-native-packager and distributed it with Spark Submit.
The point here is to use these frameworks in Spark. To take advantage of Spark you would need split your files in blocks to distribute the algorithm across the cluster for one file. Or if your files are compressed and you have each of them in one hdfs partition you would need to adjust the size of the hdfs blocks, etc ...
You can use the C modules and include them in your project and call them via JNI as frameworks like deep learning frameworks use the native linear algebra functions, etc. So, in essence, there are a lot to discuss about how to implement these delta algorithms in Spark.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论