在PySpark中是否有与Azure Data Flow中的countAll和countAllDistinct类似的替代函数?

huangapple go评论71阅读模式
英文:

Is there alternative functions in PySpark as countAll,countAllDistinct in Azure data Flow

问题

在PySpark中,只有count()和countDistinct()两种方法。那么如何在PySpark中实现countAll()和CountAllDistinct()呢?

英文:

I have count-->counts null exclusive
countAll---> counts null inclusive
Similarly countDistinct ,countAllDistinct in Azure Data Flow Aggregation transformation.

But in pySpark,there are only count() and countDistinct().So how to achieve countAll() and CountAllDistict() in PySpark.

答案1

得分: 0

正如您所知,对于CountAllCountDistinctAll,没有内置函数可供使用,您可以按照以下方式解决:

要计算PySpark DataFrame中的所有行数,包括空值,您可以直接使用count()函数,而无需使用when()函数。

如果您有一个Pyspark数据框

from pyspark.sql.functions import count, col, sum, countDistinct, when
# 计算所有行,不包括空值
df.select(count("column name")).show()
# 计算所有行,包括空值
df.select(sum(when(col("column name").isNull() | col("column name").isNotNull(), 1).otherwise(0))).show()

  
# 计算所有不同的行,不包括空值
df.select(countDistinct("column name")).show()
# 计算所有不同的行,包括空值
df.select("column name").distinct().count()

执行和输出:
在PySpark中是否有与Azure Data Flow中的countAll和countAllDistinct类似的替代函数?

如果您有一个Pyspark数据框

-- 计算所有行,包括空值
SELECT COUNT(*) AS total_count FROM (SELECT value FROM sampleView);

-- 计算所有不同的行,包括空值
SELECT COUNT(*) AS distinct_total_count FROM (SELECT DISTINCT value FROM sampleView);

执行和输出:
在PySpark中是否有与Azure Data Flow中的countAll和countAllDistinct类似的替代函数?

英文:

As you know there is no Built-in functions for CountAll and CountDistinctAll to work around it you can follow below:

To count all rows in a PySpark DataFrame, including null values, you can use the count() function directly without using the when() function.

If you have dataframe in pyspark:

from pyspark.sql.functions import count, col, sum, countDistinct, when
#count all rows excluding null
df.select(count("column name")).show()
#count all rows including null
df.select(sum(when(col("column name").isNull() | col("column name").isNotNull(), 1).otherwise(0))).show()

  
#count all distinct rows excluding null
df.select(countDistinct("column name")).show()
#count all distinct rows including null
df.select("column name").distinct().count()

Execution and OUTPUT:
在PySpark中是否有与Azure Data Flow中的countAll和countAllDistinct类似的替代函数?

If you have dataframe in pyspark:

#count all rows including null
SELECT  COUNT(*) AS total_count FROM (Select value from sampleView);

#count all distinct rows including null
SELECT  COUNT(*) AS distinct_total_count FROM (Select  distinct value from sampleView);

Execution and OUTPUT:

在PySpark中是否有与Azure Data Flow中的countAll和countAllDistinct类似的替代函数?

huangapple
  • 本文由 发表于 2023年6月22日 19:20:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76531358.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定