2023年6月9日 09:56:06go评论97阅读模式

英文:

data frame: How to round the values in some columns in Databricks Pyspark

问题

在Databricks中已经创建了以下dataframe。
某些列中的值应该只舍入为整数，这意味着4.5舍入为5，2.3舍入为2。
df1:
| 名称     | 数字     | 地点       | 数量       | UoM   |
| -------- | -------- | --------- | --------- | ----- |
| A        | 1.236    | 美国       | 10.558    | 个    |
| B        | 2.8988   | 法国       | 58.12999  | 个    |
数字和数量列中的值需要四舍五入。其他列应保持不变。
我希望df1可以变成如下：
| 名称     | 数字     | 地点       | 数量       | UoM   |
| -------- | -------- | --------- | --------- | ----- |
| A        | 1        | 美国       | 11        | 个    |
| B        | 3        | 法国       | 58        | 个    |
顺便说一句：2.5应该舍入为3，而不是2。
谢谢。

英文:

Pyspark python in databricks

I have the following dataframe already created in Databricks.
The values in some columns should be rounded to integer only, which means 4.5 round to 5, 2.3 round to 2

df1:

Name	Number	Location	Quantity	UoM
A	1.236	USA	10.558	PC
B	2.8988	France	58.12999	PC

the values in Number and Quantity columns need to be rounded. the other columns should be kept the same.

I expect the df1 could be changed to like below

Name	Number	Location	Quantity	UoM
A	1	USA	11	PC
B	3	France	58	PC

BTW: 2.5 should be rounded to 3, not 2

Thanks.

答案1

得分: 1

你可以使用 round 函数来实现所需的解决方案。

from pyspark.sql.functions import round
df = spark.createDataFrame(
    [["A", 1.236, "USA", 9.5, "PC"], ["B", 2.8988, "France", 58.12999, "PC"]],
    ["Name", "Number", "Location", "Quantity", "UoM"]
)
df.withColumn("Number", round("Number")).withColumn("Quantity", round("Quantity")).show()

这将以浮点数形式输出列，但如果需要，你可以将其转换为整数。

+----+------+--------+--------+---+
|Name|Number|Location|Quantity|UoM|
+----+------+--------+--------+---+
|   A|   1.0|     USA|    11.0| PC|
|   B|   3.0|  France|    58.0| PC|
+----+------+--------+--------+---+

英文:

You can use the round function to achieve the desired solution.

from pyspark.sql.functions import round
df = spark.createDataFrame(
        [[&quot;A&quot;, 1.236, &quot;USA&quot;, 9.5, &quot;PC&quot;], [&quot;B&quot;, 2.8988, &quot;France&quot;,58.12999, &quot;PC&quot;]],
        [&quot;Name&quot;, &quot;Number&quot;, &quot;Location&quot;, &quot;Quantity&quot;, &quot;UoM&quot;])
df.withColumn(&quot;Number&quot;, round(&quot;Number&quot;)).withColumn(&quot;Quantity&quot;, round(&quot;Quantity&quot;)).show()

This outputs the columns in floating points but you can cast it to integer if required.

+----+------+--------+--------+---+
|Name|Number|Location|Quantity|UoM|
+----+------+--------+--------+---+
|   A|   1.0|     USA|    11.0| PC|
|   B|   3.0|  France|    58.0| PC|
+----+------+--------+--------+---+

答案2

得分: 0

根据DataBricks文档的假设，您正在提到一个pandas DataFrame，您可以在DataFrame上使用apply方法。

import pandas as pd
df = pd.DataFrame([[1.1, "a", 2.22222222222], [1.5, "b", 5.5555555555], [1.7, "c", 6.666666]], columns=["0", "1", "2"])
print(temp)
"""
     0  1         2
0  1.1  a  2.222222
1  1.5  b  5.555556
2  1.7  c  6.666666
"""
df[["0", "2"]] = df[["0", "2"]].apply(round).astype("int32")
print(temp)
"""
   0  1  2
0  1  a  2
1  2  b  6
2  2  c  7
"""

除了在您要更改的特定列上使用.apply之外，您还会注意到我在结果上调用了.astype。这是因为即使round会返回整数，因为没有提供ndigits，但DataFrame将这些列注册为浮点数，并且会将值转换为浮点数。使用.astype("int32")将它们重新转换为整数值。根据需要，您还可以更改类型，例如对于非常大的值使用int64。

英文:

Assuming based on DataBricks documentation that you are referring to a pandas DataFrame, you can use apply on the dataframe.

import pandas as pd
df = pd.DataFrame([[1.1, &quot;a&quot;, 2.22222222222], [1.5, &quot;b&quot;, 5.5555555555], [1.7, &quot;c&quot;, 6.666666]], columns=[&quot;0&quot;, &quot;1&quot;, &quot;2&quot;])
print(temp)
&quot;&quot;&quot;
     0  1         2
0  1.1  a  2.222222
1  1.5  b  5.555556
2  1.7  c  6.666666
&quot;&quot;&quot;
df[[&quot;0&quot;, &quot;2&quot;]] = df[[&quot;0&quot;, &quot;2&quot;]].apply(round).astype(&quot;int32&quot;)
print(temp)
&quot;&quot;&quot;
   0  1  2
0  1  a  2
1  2  b  6
2  2  c  7
&quot;&quot;&quot;

On top of using .apply on the specific columns that you want to change, you would notice that I also call .astype on the results. This is because even though round would return an integer since no ndigits is supplied, the DataFrame has those columns registered as a float and would have the values casted as a float. Converting it with .astype("int32") would turn them back into integer values. You could change the type depending on need as well, such as int64 for very large values.

答案3

得分: 0

我不确定为什么之前的帖子谈到了Pandas。

对于pyspark，您可以创建一个自定义函数，然后使用它：

def custom_round(column):
    return F.when(F.col(column) - F.floor(F.col(column)) >= 0.5, F.ceil(F.col(column))).otherwise(F.floor(F.col(column)))
df = df.withColumn("RoundedQuantity", custom_round_udf(F.col("Quantity")))

英文:

I am not sure why previous post talked about Pandas.

For pyspark, you can create a custom function and then use it :

def custom_round(column):
    return F.when(F.col(column) - F.floor(F.col(column)) &gt;= 0.5, F.ceil(F.col(column))).otherwise(F.floor(F.col(column)))
df = df.withColumn(&quot;RoundedQuantity&quot;, custom_round_udf(F.col(&quot;Quantity&quot;)))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

数据框架：如何在Databricks Pyspark中对某些列中的值四舍五入

问题

答案1

答案2

答案3

Python ModuleNotFoundError: No module named ‘numpy.random._pcg64’

从值和阈值高效创建一系列集合

For each group in Pandas dataframe, return the most common value if it shows up more than `x%` of the time

如何在数据框之间执行并行处理？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。