2023年2月24日 04:15:34go评论117阅读模式

英文:

Java Spark withColumn algebra by example

问题

Spark (spark-core_2.13:3.3.2) 和 Java 11 在这里（非常重要，我需要 Java 的解决方案，请不要使用 Scala！）。

我正在像这样将 Excel 电子表格读入 Dataset 中：

public void runSpark(MyDataJob dataJob, JavaSparkContext sparkContext) {

    SparkSession session = SparkSession.builder().sparkContext(sparkContext.sc()).getOrCreate();

    Dataset<Row> dataset = session.read()
        .format("com.crealytics.spark.excel")
        .option("dataAddress", "'My Sheet'!B3:C35")
        .option("header", "true")
        .option("treatEmptyValuesAsNulls", "true")
        .option("setErrorCellsToFallbackValues", "true")
        .option("usePlainNumberFormat", "false")
        .option("inferSchema", "true")
        .option("addColorColumns", "true")
        .option("timestampFormat", "MM-dd-yyyy HH:mm:ss")
        .option("maxRowsInMemory", 100)
        .option("maxByteArraySize", 2147483647)
        .option("tempFileThreshold", 10000000)
        .option("excerptSize", 10)
        .load(dataJob.getFileName());

    dataset.withColumn("CountDiff", ???);

}

电子表格中有 2 列，NumFizz 和 NumBuzz，因此，我猜想 Dataset 也有这些列。我需要添加一个新列，该列是每行中这些值的差异，也就是说，如果一行的 NumFizz 值是 17，而它的 NumBuzz 是 10，那么新列中的值应该是 7。不幸的是，由于几乎所有的 withColumn 示例似乎都是使用 Scala 编写的，我无法弄清楚如何在 Java 中实现这一点。如果使用 Spark SQL 有一个简单的解决方案，我也愿意尝试。我只是需要在我的 Dataset 中添加一个新的 CountDiff 列，该列包含这两列的差异。

有人可以指导我正确的方向吗？

我尝试导入 col 函数并将它们作为参数传递，但由于缺乏可行的 Java 示例，阻碍了我的进展。

英文:

Spark (spark-core_2.13:3.3.2) and Java 11 here (very important, I need Java solutions please, not Scala!).

I am reading an Excel spreadsheet into a Dataset like so:

public void runSpark(MyDataJob dataJob, JavaSparkContext sparkContext) {

    SparkSession session = SparkSession.builder().sparkContext(sparkContext.sc()).getOrCreate();

    Dataset&lt;Row&gt; dataset = session.read()
        .format(&quot;com.crealytics.spark.excel&quot;)
        .option(&quot;dataAddress&quot;, &quot;&#39;My Sheet&#39;!B3:C35&quot;)
        .option(&quot;header&quot;, &quot;true&quot;)
        .option(&quot;treatEmptyValuesAsNulls&quot;, &quot;true&quot;)
        .option(&quot;setErrorCellsToFallbackValues&quot;, &quot;true&quot;)
        .option(&quot;usePlainNumberFormat&quot;, &quot;false&quot;)
        .option(&quot;inferSchema&quot;, &quot;true&quot;)
        .option(&quot;addColorColumns&quot;, &quot;true&quot;)
        .option(&quot;timestampFormat&quot;, &quot;MM-dd-yyyy HH:mm:ss&quot;)
        .option(&quot;maxRowsInMemory&quot;, 100)
        .option(&quot;maxByteArraySize&quot;, 2147483647)
        .option(&quot;tempFileThreshold&quot;, 10000000)
        .option(&quot;excerptSize&quot;, 10)
        .load(dataJob.getFileName());

    dataset.withColumn(&quot;CountDiff&quot;, ???);

}

The spreadsheet has 2 columns in it, NumFizz and NumBuzz, and hence, I'm guessing the Dataset has these columns as well. I need to add a new column that is the difference of these values in each row, meaning if a row's NumFizz value is 17, and its NumBuzz is 10, then its value in the new column should be 7. Unfortunately, since literally all the withColumn examples appear to be in Scala, I can't figure out how to do this in Java. I am also open to using Spark SQL if there's a simple solution using that as well. I just need a new CountDiff column added to my Dataset that has the difference of these two columns.

Can anyone nudge me in the right direction?

I tried to import the col functions and pass them in as args but the lack of viable Java examples is blocking me from making headway.

答案1

得分: 0

下面是翻译好的部分：

dataset.withColumn("CountDiff", dataset.col("Numfizz").minus(dataset.col("Numbuzz")));

// Scala: The following selects the difference between people's height and their weight.
people.select( people("height") - people("weight") )

// Java:
people.select( people.col("height").minus(people.col("weight")) );

// alternatively use SQL

You can also use SQL to perform express all the transformations you need:

dataset.createOrReplaceTempView("my_sheet")

session.sql("
SELECT
*, 
Numfizz - Numbuzz as CountDiff,
... -- other calculations you may need
FROM my_sheet 
").show();

在处理 SQL 时，您可能会发现 Spark 中可用的 SQL 函数文档很有用。

英文:

How about this:

dataset.withColumn(&quot;CountDiff&quot;, dataset.col(&quot;Numfizz&quot;).minus(dataset.col(&quot;Numbuzz&quot;)));

The Column.minus docs show examples of both Scala and Java usage, where scala applies it's magic operator conversion, and Java needs explicit function calls:

// Scala: The following selects the difference between people&#39;s height and their weight.
people.select( people(&quot;height&quot;) - people(&quot;weight&quot;) )

// Java:
people.select( people.col(&quot;height&quot;).minus(people.col(&quot;weight&quot;)) );

alternatively use SQL

You can also use SQL to perform express all the transformations you need:

dataset.createOrReplaceTempView(&quot;my_sheet&quot;)

session.sql(&quot;
SELECT
*, 
Numfizz - Numbuzz as CountDiff,
... -- other calculations you may need
FROM my_sheet 
&quot;).show();

When working with SQL you will may find the docs for SQL functions available in spark useful.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Java Spark的withColumn示例代数

问题

答案1

alternatively use SQL

如何在不使用Eclipse的情况下使用WebSphere 8.5.5

为什么Java命令在将Java文件作为参数使用时有效

如何在ANTLR4中发出一个标记？

为什么我可以修改类中的私有属性？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论