2023年6月22日 19:20:31go评论71阅读模式

英文:

Is there alternative functions in PySpark as countAll,countAllDistinct in Azure data Flow

问题

在PySpark中，只有count()和countDistinct()两种方法。那么如何在PySpark中实现countAll()和CountAllDistinct()呢？

英文:

I have count-->counts null exclusive
countAll---> counts null inclusive
Similarly countDistinct ,countAllDistinct in Azure Data Flow Aggregation transformation.

But in pySpark,there are only count() and countDistinct().So how to achieve countAll() and CountAllDistict() in PySpark.

答案1

得分: 0

正如您所知，对于CountAll和CountDistinctAll，没有内置函数可供使用，您可以按照以下方式解决：

要计算PySpark DataFrame中的所有行数，包括空值，您可以直接使用count()函数，而无需使用when()函数。

如果您有一个Pyspark数据框：

from pyspark.sql.functions import count, col, sum, countDistinct, when
# 计算所有行，不包括空值
df.select(count("column name")).show()
# 计算所有行，包括空值
df.select(sum(when(col("column name").isNull() | col("column name").isNotNull(), 1).otherwise(0))).show()

  
# 计算所有不同的行，不包括空值
df.select(countDistinct("column name")).show()
# 计算所有不同的行，包括空值
df.select("column name").distinct().count()

执行和输出：
在PySpark中是否有与Azure Data Flow中的countAll和countAllDistinct类似的替代函数？

如果您有一个Pyspark数据框：

-- 计算所有行，包括空值
SELECT COUNT(*) AS total_count FROM (SELECT value FROM sampleView);

-- 计算所有不同的行，包括空值
SELECT COUNT(*) AS distinct_total_count FROM (SELECT DISTINCT value FROM sampleView);

执行和输出：
在PySpark中是否有与Azure Data Flow中的countAll和countAllDistinct类似的替代函数？

英文:

As you know there is no Built-in functions for CountAll and CountDistinctAll to work around it you can follow below:

To count all rows in a PySpark DataFrame, including null values, you can use the count() function directly without using the when() function.

If you have dataframe in pyspark:

from pyspark.sql.functions import count, col, sum, countDistinct, when
#count all rows excluding null
df.select(count(&quot;column name&quot;)).show()
#count all rows including null
df.select(sum(when(col(&quot;column name&quot;).isNull() | col(&quot;column name&quot;).isNotNull(), 1).otherwise(0))).show()

  
#count all distinct rows excluding null
df.select(countDistinct(&quot;column name&quot;)).show()
#count all distinct rows including null
df.select(&quot;column name&quot;).distinct().count()

Execution and OUTPUT:
在PySpark中是否有与Azure Data Flow中的countAll和countAllDistinct类似的替代函数？

If you have dataframe in pyspark:

#count all rows including null
SELECT  COUNT(*) AS total_count FROM (Select value from sampleView);

#count all distinct rows including null
SELECT  COUNT(*) AS distinct_total_count FROM (Select  distinct value from sampleView);

Execution and OUTPUT:

在PySpark中是否有与Azure Data Flow中的countAll和countAllDistinct类似的替代函数？

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在PySpark中是否有与Azure Data Flow中的countAll和countAllDistinct类似的替代函数？

问题

答案1

Azure Blob Storage PUT请求没有返回结果代码

Azure ArmClient 重命名和复制数据库操作

如何在 Azure 数据工厂管道运行被取消时删除已摄取的数据？

更新 Azure 资源的标签

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论