英文:
data frame: How to round the values in some columns in Databricks Pyspark
问题
在Databricks中已经创建了以下dataframe。
某些列中的值应该只舍入为整数,这意味着4.5舍入为5,2.3舍入为2。
df1:
| 名称 | 数字 | 地点 | 数量 | UoM |
| -------- | -------- | --------- | --------- | ----- |
| A | 1.236 | 美国 | 10.558 | 个 |
| B | 2.8988 | 法国 | 58.12999 | 个 |
数字和数量列中的值需要四舍五入。其他列应保持不变。
我希望df1可以变成如下:
| 名称 | 数字 | 地点 | 数量 | UoM |
| -------- | -------- | --------- | --------- | ----- |
| A | 1 | 美国 | 11 | 个 |
| B | 3 | 法国 | 58 | 个 |
顺便说一句:2.5应该舍入为3,而不是2。
谢谢。
英文:
Pyspark python in databricks
I have the following dataframe already created in Databricks.
The values in some columns should be rounded to integer only, which means 4.5 round to 5, 2.3 round to 2
df1:
Name | Number | Location | Quantity | UoM |
---|---|---|---|---|
A | 1.236 | USA | 10.558 | PC |
B | 2.8988 | France | 58.12999 | PC |
the values in Number and Quantity columns need to be rounded. the other columns should be kept the same.
I expect the df1 could be changed to like below
Name | Number | Location | Quantity | UoM |
---|---|---|---|---|
A | 1 | USA | 11 | PC |
B | 3 | France | 58 | PC |
BTW: 2.5 should be rounded to 3, not 2
Thanks.
答案1
得分: 1
你可以使用 round 函数来实现所需的解决方案。
from pyspark.sql.functions import round
df = spark.createDataFrame(
[["A", 1.236, "USA", 9.5, "PC"], ["B", 2.8988, "France", 58.12999, "PC"]],
["Name", "Number", "Location", "Quantity", "UoM"]
)
df.withColumn("Number", round("Number")).withColumn("Quantity", round("Quantity")).show()
这将以浮点数形式输出列,但如果需要,你可以将其转换为整数。
+----+------+--------+--------+---+
|Name|Number|Location|Quantity|UoM|
+----+------+--------+--------+---+
| A| 1.0| USA| 11.0| PC|
| B| 3.0| France| 58.0| PC|
+----+------+--------+--------+---+
英文:
You can use the round function to achieve the desired solution.
from pyspark.sql.functions import round
df = spark.createDataFrame(
[["A", 1.236, "USA", 9.5, "PC"], ["B", 2.8988, "France",58.12999, "PC"]],
["Name", "Number", "Location", "Quantity", "UoM"])
df.withColumn("Number", round("Number")).withColumn("Quantity", round("Quantity")).show()
This outputs the columns in floating points but you can cast it to integer if required.
+----+------+--------+--------+---+
|Name|Number|Location|Quantity|UoM|
+----+------+--------+--------+---+
| A| 1.0| USA| 11.0| PC|
| B| 3.0| France| 58.0| PC|
+----+------+--------+--------+---+
答案2
得分: 0
根据DataBricks文档的假设,您正在提到一个pandas DataFrame,您可以在DataFrame上使用apply
方法。
import pandas as pd
df = pd.DataFrame([[1.1, "a", 2.22222222222], [1.5, "b", 5.5555555555], [1.7, "c", 6.666666]], columns=["0", "1", "2"])
print(temp)
"""
0 1 2
0 1.1 a 2.222222
1 1.5 b 5.555556
2 1.7 c 6.666666
"""
df[["0", "2"]] = df[["0", "2"]].apply(round).astype("int32")
print(temp)
"""
0 1 2
0 1 a 2
1 2 b 6
2 2 c 7
"""
除了在您要更改的特定列上使用.apply
之外,您还会注意到我在结果上调用了.astype
。这是因为即使round
会返回整数,因为没有提供ndigits
,但DataFrame将这些列注册为浮点数,并且会将值转换为浮点数。使用.astype("int32")
将它们重新转换为整数值。根据需要,您还可以更改类型,例如对于非常大的值使用int64
。
英文:
Assuming based on DataBricks documentation that you are referring to a pandas DataFrame, you can use apply on the dataframe.
import pandas as pd
df = pd.DataFrame([[1.1, "a", 2.22222222222], [1.5, "b", 5.5555555555], [1.7, "c", 6.666666]], columns=["0", "1", "2"])
print(temp)
"""
0 1 2
0 1.1 a 2.222222
1 1.5 b 5.555556
2 1.7 c 6.666666
"""
df[["0", "2"]] = df[["0", "2"]].apply(round).astype("int32")
print(temp)
"""
0 1 2
0 1 a 2
1 2 b 6
2 2 c 7
"""
On top of using .apply
on the specific columns that you want to change, you would notice that I also call .astype
on the results. This is because even though round
would return an integer since no ndigits
is supplied, the DataFrame has those columns registered as a float and would have the values casted as a float. Converting it with .astype("int32")
would turn them back into integer values. You could change the type depending on need as well, such as int64
for very large values.
答案3
得分: 0
我不确定为什么之前的帖子谈到了Pandas。
对于pyspark,您可以创建一个自定义函数,然后使用它:
def custom_round(column):
return F.when(F.col(column) - F.floor(F.col(column)) >= 0.5, F.ceil(F.col(column))).otherwise(F.floor(F.col(column)))
df = df.withColumn("RoundedQuantity", custom_round_udf(F.col("Quantity")))
英文:
I am not sure why previous post talked about Pandas.
For pyspark, you can create a custom function and then use it :
def custom_round(column):
return F.when(F.col(column) - F.floor(F.col(column)) >= 0.5, F.ceil(F.col(column))).otherwise(F.floor(F.col(column)))
df = df.withColumn("RoundedQuantity", custom_round_udf(F.col("Quantity")))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论