使用Spark DataFrame找到列之间的最小值。

huangapple go评论85阅读模式
英文:

Find the least value in between the columns using Spark DataFrame

问题

我有一个如下所示的数据框,并需要找到除零以外的最小值,并将其添加到一个新列中,列名为'Least'。

Column1 Column2 Column3
100.0 120.0 150.0
200.0 0.0 0.0
0.0 20.0 100.0

我尝试使用least()函数,但没有得到预期的输出。

预期的输出应该如下所示。

Column1 Column2 Column3 Least
100.0 120.0 150.0 100.0
200.0 0.0 0.0 200.0
0.0 20.0 100.0 20.0
英文:

I have a dataframe like below and need to find the least value except zeros and add it in a new column as 'Least'.

Column1 Column2 Column3
100.0 120.0 150.0
200.0 0.0 0.0
0.0 20.0 100.0

I tried with least() function but I didn't get the expected output.

expected output would be like below.

Column1 Column2 Column3 Least
100.0 120.0 150.0 100.0
200.0 0.0 0.0 200.0
0.0 20.0 100.0 20.0

答案1

得分: 0

你可以像这样获取最小值:

import sparkSession.implicits._

val df = List(
  (100.0, 120.0, 150.0),
  (200.0, 0.0, 0.0),
  (0.0, 20.0, 100.0)
).toDF("column1", "column2", "column3")

val columns = df.columns.toSeq

val leastRow = least(
  columns.map(col): _*
).alias("min")

df.select($"*", leastRow).show

尝试改进**leastRow**方法以忽略零值考虑将零值替换为在您的用例中可能的最大浮点值通常为**Double.PositiveInfinity**
不要犹豫发布您的工作并确保您将获得帮助
祝你好运
英文:

You can do something like this to get the least values

import sparkSession.implicits._

      val df = List(
        (100.0, 120.0, 150.0),
        (200.0, 0.0, 0.0),
        (0.0, 20.0, 100.0)
      ).toDF("column1", "column2", "column3")
      
      val columns = df.columns.toSeq
      
      
      val leastRow = least(
        columns map col: _*
      ).alias("min")

      df.select($"*", leastRow).show

Try to improve the leastRow method to ignore the zero values. think about replacing the zero values with the maximum possible float value in your use case, Double.PositiveInfinity in general ect..
Do not hesitate to post your work and be sure that you'll get help !
Good luck.

答案2

得分: 0

以下是翻译好的部分:

我找到了两种解决方案:

  1. 在计算最小值时,您可以将0交换为无穷大以跳过零值
  2. 您可以使用用户定义函数(UDF)来跳过零值

选项1)

import org.apache.spark.sql.functions._

val data = Seq((100.0, 120.0, 150.0), (200.0, 0.0, 0.0), (0.0, 20.0, 100.0))
val columns = Seq("Column1", "Column2", "Column3")
val df: DataFrame = spark.createDataFrame(data).toDF(columns: _*)

// 计算最小值,将所有零值替换为正无穷大
val leastValueExpr = least(
  df.columns.map(colName =>
    when(col(colName) =!= 0.0, col(colName)).otherwise(Double.PositiveInfinity)
  ): _*
)
val resultDF = df.withColumn("Least", leastValueExpr)

resultDF.show()

选项2

// 定义一个UDF来计算最小值,排除零值
val leastNonZero: UserDefinedFunction =
  udf((cols: Seq[Double]) => cols.filter(_ != 0.0).min)

// 应用UDF来计算最小值,排除零值
val dfWithLeast: DataFrame =
  df.withColumn("Least", leastNonZero(array(df.columns.map(col): _*)))

// 显示结果
dfWithLeast.show()

输出:

+-------+-------+-------+-----+
|Column1|Column2|Column3|Least|
+-------+-------+-------+-----+
|  100.0|  120.0|  150.0|100.0|
|  200.0|    0.0|    0.0|200.0|
|    0.0|   20.0|  100.0| 20.0|
+-------+-------+-------+-----+
英文:

I found two solutions:

  1. You can swap 0 to inf when calculating least to skip zeroes
  2. You can use udf which will skip zeroes

Option 1)

import org.apache.spark.sql.functions._

val data = Seq((100.0, 120.0, 150.0), (200.0, 0.0, 0.0), (0.0, 20.0, 100.0))
val columns = Seq("Column1", "Column2", "Column3")
val df: DataFrame = spark.createDataFrame(data).toDF(columns: _*)

// Calculate the least value, swaping all zeroes to inf
val leastValueExpr = least(
  df.columns.map(colName =>
    when(col(colName) =!= 0.0, col(colName)).otherwise(Double.PositiveInfinity)
  ): _*
)
val resultDF = df.withColumn("Least", leastValueExpr)

resultDF.show()

Option 2

// Define a UDF to calculate the least value, excluding zero values
val leastNonZero: UserDefinedFunction =
  udf((cols: Seq[Double]) => cols.filter(_ != 0.0).min)

// Apply the UDF to calculate the least value, excluding zero values
val dfWithLeast: DataFrame =
  df.withColumn("Least", leastNonZero(array(df.columns.map(col): _*)))

// Show the result
dfWithLeast.show()

Output:

+-------+-------+-------+-----+
|Column1|Column2|Column3|Least|
+-------+-------+-------+-----+
|  100.0|  120.0|  150.0|100.0|
|  200.0|    0.0|    0.0|200.0|
|    0.0|   20.0|  100.0| 20.0|
+-------+-------+-------+-----+

huangapple
  • 本文由 发表于 2023年7月27日 21:20:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76780175.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定