Java Spark的withColumn示例代数

huangapple go评论105阅读模式
英文:

Java Spark withColumn algebra by example

问题

Spark (spark-core_2.13:3.3.2) 和 Java 11 在这里(非常重要,我需要 Java 的解决方案,请不要使用 Scala!)。

我正在像这样将 Excel 电子表格读入 Dataset 中:

public void runSpark(MyDataJob dataJob, JavaSparkContext sparkContext) {

    SparkSession session = SparkSession.builder().sparkContext(sparkContext.sc()).getOrCreate();

    Dataset<Row> dataset = session.read()
        .format("com.crealytics.spark.excel")
        .option("dataAddress", "'My Sheet'!B3:C35")
        .option("header", "true")
        .option("treatEmptyValuesAsNulls", "true")
        .option("setErrorCellsToFallbackValues", "true")
        .option("usePlainNumberFormat", "false")
        .option("inferSchema", "true")
        .option("addColorColumns", "true")
        .option("timestampFormat", "MM-dd-yyyy HH:mm:ss")
        .option("maxRowsInMemory", 100)
        .option("maxByteArraySize", 2147483647)
        .option("tempFileThreshold", 10000000)
        .option("excerptSize", 10)
        .load(dataJob.getFileName());

    dataset.withColumn("CountDiff", ???);

}

电子表格中有 2 列,NumFizzNumBuzz,因此,我猜想 Dataset 也有这些列。我需要添加一个新列,该列是每行中这些值的差异,也就是说,如果一行的 NumFizz 值是 17,而它的 NumBuzz 是 10,那么新列中的值应该是 7。不幸的是,由于几乎所有withColumn 示例似乎都是使用 Scala 编写的,我无法弄清楚如何在 Java 中实现这一点。如果使用 Spark SQL 有一个简单的解决方案,我也愿意尝试。我只是需要在我的 Dataset 中添加一个新的 CountDiff 列,该列包含这两列的差异。

有人可以指导我正确的方向吗?

我尝试导入 col 函数并将它们作为参数传递,但由于缺乏可行的 Java 示例,阻碍了我的进展。

英文:

Spark (spark-core_2.13:3.3.2) and Java 11 here (very important, I need Java solutions please, not Scala!).

I am reading an Excel spreadsheet into a Dataset like so:

public void runSpark(MyDataJob dataJob, JavaSparkContext sparkContext) {

    SparkSession session = SparkSession.builder().sparkContext(sparkContext.sc()).getOrCreate();

    Dataset&lt;Row&gt; dataset = session.read()
        .format(&quot;com.crealytics.spark.excel&quot;)
        .option(&quot;dataAddress&quot;, &quot;&#39;My Sheet&#39;!B3:C35&quot;)
        .option(&quot;header&quot;, &quot;true&quot;)
        .option(&quot;treatEmptyValuesAsNulls&quot;, &quot;true&quot;)
        .option(&quot;setErrorCellsToFallbackValues&quot;, &quot;true&quot;)
        .option(&quot;usePlainNumberFormat&quot;, &quot;false&quot;)
        .option(&quot;inferSchema&quot;, &quot;true&quot;)
        .option(&quot;addColorColumns&quot;, &quot;true&quot;)
        .option(&quot;timestampFormat&quot;, &quot;MM-dd-yyyy HH:mm:ss&quot;)
        .option(&quot;maxRowsInMemory&quot;, 100)
        .option(&quot;maxByteArraySize&quot;, 2147483647)
        .option(&quot;tempFileThreshold&quot;, 10000000)
        .option(&quot;excerptSize&quot;, 10)
        .load(dataJob.getFileName());

    dataset.withColumn(&quot;CountDiff&quot;, ???);

}

The spreadsheet has 2 columns in it, NumFizz and NumBuzz, and hence, I'm guessing the Dataset has these columns as well. I need to add a new column that is the difference of these values in each row, meaning if a row's NumFizz value is 17, and its NumBuzz is 10, then its value in the new column should be 7. Unfortunately, since literally all the withColumn examples appear to be in Scala, I can't figure out how to do this in Java. I am also open to using Spark SQL if there's a simple solution using that as well. I just need a new CountDiff column added to my Dataset that has the difference of these two columns.

Can anyone nudge me in the right direction?

I tried to import the col functions and pass them in as args but the lack of viable Java examples is blocking me from making headway.

答案1

得分: 0

下面是翻译好的部分:

dataset.withColumn("CountDiff", dataset.col("Numfizz").minus(dataset.col("Numbuzz")));
// Scala: The following selects the difference between people's height and their weight.
people.select( people("height") - people("weight") )

// Java:
people.select( people.col("height").minus(people.col("weight")) );
// alternatively use SQL

You can also use SQL to perform express all the transformations you need:

dataset.createOrReplaceTempView("my_sheet")

session.sql("
SELECT
*, 
Numfizz - Numbuzz as CountDiff,
... -- other calculations you may need
FROM my_sheet 
").show();

在处理 SQL 时,您可能会发现 Spark 中可用的 SQL 函数文档很有用。

英文:

How about this:

dataset.withColumn(&quot;CountDiff&quot;, dataset.col(&quot;Numfizz&quot;).minus(dataset.col(&quot;Numbuzz&quot;)));

The Column.minus docs show examples of both Scala and Java usage, where scala applies it's magic operator conversion, and Java needs explicit function calls:

// Scala: The following selects the difference between people&#39;s height and their weight.
people.select( people(&quot;height&quot;) - people(&quot;weight&quot;) )

// Java:
people.select( people.col(&quot;height&quot;).minus(people.col(&quot;weight&quot;)) );

alternatively use SQL

You can also use SQL to perform express all the transformations you need:

dataset.createOrReplaceTempView(&quot;my_sheet&quot;)

session.sql(&quot;
SELECT
*, 
Numfizz - Numbuzz as CountDiff,
... -- other calculations you may need
FROM my_sheet 
&quot;).show();

When working with SQL you will may find the docs for SQL functions available in spark useful.

huangapple
  • 本文由 发表于 2023年2月24日 04:15:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/75549888.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定