数据框架:如何在Databricks Pyspark中对某些列中的值四舍五入

huangapple go评论97阅读模式
英文:

data frame: How to round the values in some columns in Databricks Pyspark

问题

  1. Databricks中已经创建了以下dataframe
  2. 某些列中的值应该只舍入为整数,这意味着4.5舍入为52.3舍入为2
  3. df1:
  4. | 名称 | 数字 | 地点 | 数量 | UoM |
  5. | -------- | -------- | --------- | --------- | ----- |
  6. | A | 1.236 | 美国 | 10.558 | |
  7. | B | 2.8988 | 法国 | 58.12999 | |
  8. 数字和数量列中的值需要四舍五入。其他列应保持不变。
  9. 我希望df1可以变成如下:
  10. | 名称 | 数字 | 地点 | 数量 | UoM |
  11. | -------- | -------- | --------- | --------- | ----- |
  12. | A | 1 | 美国 | 11 | |
  13. | B | 3 | 法国 | 58 | |
  14. 顺便说一句:2.5应该舍入为3,而不是2
  15. 谢谢。
英文:

Pyspark python in databricks

I have the following dataframe already created in Databricks.
The values in some columns should be rounded to integer only, which means 4.5 round to 5, 2.3 round to 2

df1:

Name Number Location Quantity UoM
A 1.236 USA 10.558 PC
B 2.8988 France 58.12999 PC

the values in Number and Quantity columns need to be rounded. the other columns should be kept the same.

I expect the df1 could be changed to like below

Name Number Location Quantity UoM
A 1 USA 11 PC
B 3 France 58 PC

BTW: 2.5 should be rounded to 3, not 2

Thanks.

答案1

得分: 1

你可以使用 round 函数来实现所需的解决方案。

  1. from pyspark.sql.functions import round
  2. df = spark.createDataFrame(
  3. [["A", 1.236, "USA", 9.5, "PC"], ["B", 2.8988, "France", 58.12999, "PC"]],
  4. ["Name", "Number", "Location", "Quantity", "UoM"]
  5. )
  6. df.withColumn("Number", round("Number")).withColumn("Quantity", round("Quantity")).show()

这将以浮点数形式输出列,但如果需要,你可以将其转换为整数。

  1. +----+------+--------+--------+---+
  2. |Name|Number|Location|Quantity|UoM|
  3. +----+------+--------+--------+---+
  4. | A| 1.0| USA| 11.0| PC|
  5. | B| 3.0| France| 58.0| PC|
  6. +----+------+--------+--------+---+
英文:

You can use the round function to achieve the desired solution.

  1. from pyspark.sql.functions import round
  2. df = spark.createDataFrame(
  3. [["A", 1.236, "USA", 9.5, "PC"], ["B", 2.8988, "France",58.12999, "PC"]],
  4. ["Name", "Number", "Location", "Quantity", "UoM"])
  5. df.withColumn("Number", round("Number")).withColumn("Quantity", round("Quantity")).show()

This outputs the columns in floating points but you can cast it to integer if required.

  1. +----+------+--------+--------+---+
  2. |Name|Number|Location|Quantity|UoM|
  3. +----+------+--------+--------+---+
  4. | A| 1.0| USA| 11.0| PC|
  5. | B| 3.0| France| 58.0| PC|
  6. +----+------+--------+--------+---+

答案2

得分: 0

根据DataBricks文档的假设,您正在提到一个pandas DataFrame,您可以在DataFrame上使用apply方法。

  1. import pandas as pd
  2. df = pd.DataFrame([[1.1, "a", 2.22222222222], [1.5, "b", 5.5555555555], [1.7, "c", 6.666666]], columns=["0", "1", "2"])
  3. print(temp)
  4. """
  5. 0 1 2
  6. 0 1.1 a 2.222222
  7. 1 1.5 b 5.555556
  8. 2 1.7 c 6.666666
  9. """
  10. df[["0", "2"]] = df[["0", "2"]].apply(round).astype("int32")
  11. print(temp)
  12. """
  13. 0 1 2
  14. 0 1 a 2
  15. 1 2 b 6
  16. 2 2 c 7
  17. """

除了在您要更改的特定列上使用.apply之外,您还会注意到我在结果上调用了.astype。这是因为即使round会返回整数,因为没有提供ndigits,但DataFrame将这些列注册为浮点数,并且会将值转换为浮点数。使用.astype("int32")将它们重新转换为整数值。根据需要,您还可以更改类型,例如对于非常大的值使用int64

英文:

Assuming based on DataBricks documentation that you are referring to a pandas DataFrame, you can use apply on the dataframe.

  1. import pandas as pd
  2. df = pd.DataFrame([[1.1, "a", 2.22222222222], [1.5, "b", 5.5555555555], [1.7, "c", 6.666666]], columns=["0", "1", "2"])
  3. print(temp)
  4. """
  5. 0 1 2
  6. 0 1.1 a 2.222222
  7. 1 1.5 b 5.555556
  8. 2 1.7 c 6.666666
  9. """
  10. df[["0", "2"]] = df[["0", "2"]].apply(round).astype("int32")
  11. print(temp)
  12. """
  13. 0 1 2
  14. 0 1 a 2
  15. 1 2 b 6
  16. 2 2 c 7
  17. """

On top of using .apply on the specific columns that you want to change, you would notice that I also call .astype on the results. This is because even though round would return an integer since no ndigits is supplied, the DataFrame has those columns registered as a float and would have the values casted as a float. Converting it with .astype("int32") would turn them back into integer values. You could change the type depending on need as well, such as int64 for very large values.

答案3

得分: 0

我不确定为什么之前的帖子谈到了Pandas。

对于pyspark,您可以创建一个自定义函数,然后使用它:

  1. def custom_round(column):
  2. return F.when(F.col(column) - F.floor(F.col(column)) >= 0.5, F.ceil(F.col(column))).otherwise(F.floor(F.col(column)))
  3. df = df.withColumn("RoundedQuantity", custom_round_udf(F.col("Quantity")))
英文:

I am not sure why previous post talked about Pandas.

For pyspark, you can create a custom function and then use it :

  1. def custom_round(column):
  2. return F.when(F.col(column) - F.floor(F.col(column)) >= 0.5, F.ceil(F.col(column))).otherwise(F.floor(F.col(column)))
  3. df = df.withColumn("RoundedQuantity", custom_round_udf(F.col("Quantity")))

huangapple
  • 本文由 发表于 2023年6月9日 09:56:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76436722.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定