数据框架:如何在Databricks Pyspark中对某些列中的值四舍五入

huangapple go评论65阅读模式
英文:

data frame: How to round the values in some columns in Databricks Pyspark

问题

在Databricks中已经创建了以下dataframe。
某些列中的值应该只舍入为整数,这意味着4.5舍入为5,2.3舍入为2。

df1:

| 名称     | 数字     | 地点       | 数量       | UoM   |
| -------- | -------- | --------- | --------- | ----- |
| A        | 1.236    | 美国       | 10.558    | 个    |
| B        | 2.8988   | 法国       | 58.12999  | 个    |

数字和数量列中的值需要四舍五入。其他列应保持不变。

我希望df1可以变成如下:

| 名称     | 数字     | 地点       | 数量       | UoM   |
| -------- | -------- | --------- | --------- | ----- |
| A        | 1        | 美国       | 11        | 个    |
| B        | 3        | 法国       | 58        | 个    |

顺便说一句:2.5应该舍入为3,而不是2。

谢谢。
英文:

Pyspark python in databricks

I have the following dataframe already created in Databricks.
The values in some columns should be rounded to integer only, which means 4.5 round to 5, 2.3 round to 2

df1:

Name Number Location Quantity UoM
A 1.236 USA 10.558 PC
B 2.8988 France 58.12999 PC

the values in Number and Quantity columns need to be rounded. the other columns should be kept the same.

I expect the df1 could be changed to like below

Name Number Location Quantity UoM
A 1 USA 11 PC
B 3 France 58 PC

BTW: 2.5 should be rounded to 3, not 2

Thanks.

答案1

得分: 1

你可以使用 round 函数来实现所需的解决方案。

from pyspark.sql.functions import round

df = spark.createDataFrame(
    [["A", 1.236, "USA", 9.5, "PC"], ["B", 2.8988, "France", 58.12999, "PC"]],
    ["Name", "Number", "Location", "Quantity", "UoM"]
)

df.withColumn("Number", round("Number")).withColumn("Quantity", round("Quantity")).show()

这将以浮点数形式输出列,但如果需要,你可以将其转换为整数。

+----+------+--------+--------+---+
|Name|Number|Location|Quantity|UoM|
+----+------+--------+--------+---+
|   A|   1.0|     USA|    11.0| PC|
|   B|   3.0|  France|    58.0| PC|
+----+------+--------+--------+---+
英文:

You can use the round function to achieve the desired solution.

from pyspark.sql.functions import round

df = spark.createDataFrame(
        [["A", 1.236, "USA", 9.5, "PC"], ["B", 2.8988, "France",58.12999, "PC"]],
        ["Name", "Number", "Location", "Quantity", "UoM"])

df.withColumn("Number", round("Number")).withColumn("Quantity", round("Quantity")).show()

This outputs the columns in floating points but you can cast it to integer if required.

+----+------+--------+--------+---+
|Name|Number|Location|Quantity|UoM|
+----+------+--------+--------+---+
|   A|   1.0|     USA|    11.0| PC|
|   B|   3.0|  France|    58.0| PC|
+----+------+--------+--------+---+

答案2

得分: 0

根据DataBricks文档的假设,您正在提到一个pandas DataFrame,您可以在DataFrame上使用apply方法。

import pandas as pd

df = pd.DataFrame([[1.1, "a", 2.22222222222], [1.5, "b", 5.5555555555], [1.7, "c", 6.666666]], columns=["0", "1", "2"])
print(temp)

"""
     0  1         2
0  1.1  a  2.222222
1  1.5  b  5.555556
2  1.7  c  6.666666
"""

df[["0", "2"]] = df[["0", "2"]].apply(round).astype("int32")
print(temp)

"""
   0  1  2
0  1  a  2
1  2  b  6
2  2  c  7
"""

除了在您要更改的特定列上使用.apply之外,您还会注意到我在结果上调用了.astype。这是因为即使round会返回整数,因为没有提供ndigits,但DataFrame将这些列注册为浮点数,并且会将值转换为浮点数。使用.astype("int32")将它们重新转换为整数值。根据需要,您还可以更改类型,例如对于非常大的值使用int64

英文:

Assuming based on DataBricks documentation that you are referring to a pandas DataFrame, you can use apply on the dataframe.

import pandas as pd

df = pd.DataFrame([[1.1, "a", 2.22222222222], [1.5, "b", 5.5555555555], [1.7, "c", 6.666666]], columns=["0", "1", "2"])
print(temp)

"""
     0  1         2
0  1.1  a  2.222222
1  1.5  b  5.555556
2  1.7  c  6.666666
"""

df[["0", "2"]] = df[["0", "2"]].apply(round).astype("int32")
print(temp)

"""
   0  1  2
0  1  a  2
1  2  b  6
2  2  c  7
"""

On top of using .apply on the specific columns that you want to change, you would notice that I also call .astype on the results. This is because even though round would return an integer since no ndigits is supplied, the DataFrame has those columns registered as a float and would have the values casted as a float. Converting it with .astype("int32") would turn them back into integer values. You could change the type depending on need as well, such as int64 for very large values.

答案3

得分: 0

我不确定为什么之前的帖子谈到了Pandas。

对于pyspark,您可以创建一个自定义函数,然后使用它:

def custom_round(column):
    return F.when(F.col(column) - F.floor(F.col(column)) >= 0.5, F.ceil(F.col(column))).otherwise(F.floor(F.col(column)))

df = df.withColumn("RoundedQuantity", custom_round_udf(F.col("Quantity")))
英文:

I am not sure why previous post talked about Pandas.

For pyspark, you can create a custom function and then use it :

def custom_round(column):
    return F.when(F.col(column) - F.floor(F.col(column)) >= 0.5, F.ceil(F.col(column))).otherwise(F.floor(F.col(column)))

df = df.withColumn("RoundedQuantity", custom_round_udf(F.col("Quantity")))

huangapple
  • 本文由 发表于 2023年6月9日 09:56:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76436722.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定