按另一列分组计算列中出现的次数

huangapple go评论57阅读模式
英文:

Count number of occurrences in column grouped by another column

问题

我有一个包含多列的数据框:

+-------------+--------+
|         x   |      y |
+-------------+--------+
|            a|     one| 
|            a|     one|
|            a|     two|
|            b|     one|
|            b|     two|  
|            c|     one| 
+-------------+--------+

我想要按x分组,并对每个x组计算“one”出现的次数。

类似于:

df.groupBy(x).agg(countDistinct("one")).collect()

输出将是:2, 1, 1,因为“one”在组a中出现两次,在组bc中各出现一次。

英文:

I have a dataframe with multiple columns:

+-------------+--------+
|         x   |      y |
+-------------+--------+
|            a|     one| 
|            a|     one|
|            a|     two|
|            b|     one|
|            b|     two|  
|            c|     one| 
+-------------+--------+

I would like to group by x and for each group of x count the number of times "one" occurs.

something like:

df.groupBy(x).agg(countDistinct("one")).collect()

the output would be: 2, 1, 1 since "one" occurs twice for group a and once for groups b and c

答案1

得分: 1

尝试在“group by”之前使用“filter”

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("a", "one"),
        ("a", "one"),
        ("a", "two"),
        ("b", "one"),
        ("b", "two"),
        ("c", "one")]
# 创建原始数据框
df = spark.createDataFrame(data, ["x", "y"])

# 在group_by之前使用filter
result_df = (df.filter(col('y') == 'one')
                .groupBy('x')
                .agg(count('*').alias('count_of_one')))

result_df.show()
# +---+------------+
# |  x|count_of_one|
# +---+------------+
# |  a|           2|
# |  b|           1|
# |  c|           1|
# +---+------------+

如果要计算没有匹配值的情况(0),您可以使用以下条件进行计数

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("a", "one"),
        ("a", "one"),
        ("a", "two"),
        ("b", "one"),
        ("b", "two"),
        ("c", "one"),
        ("d", "three"),
        ("d", "four")]
# 创建原始数据框
df = spark.createDataFrame(data, ["x", "y"])

# 使用条件进行计数(when)~ group by + having in SQL
# 请注意,您不能只使用count(col("y") == "one"),因为count会跳过null值
result_df = df.groupBy("x").agg(count(when(col("y") == "one", True)).alias("count_one"))

result_df.show()
# +---+---------+
# |  x|count_one|
# +---+---------+
# |  a|        2|
# |  b|        1|
# |  c|        1|
# |  d|        0|
# +---+---------+
英文:

Try to use filter before group by

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("a", "one"),
        ("a", "one"),
        ("a", "two"),
        ("b", "one"),
        ("b", "two"),
        ("c", "one")]
# Create original df
df = spark.createDataFrame(data, ["x", "y"])

# Using filter before group_by
result_df = (df.filter(col('y') == 'one')
                .groupBy('x')
                .agg(count('*').alias('count_of_one')))

result_df.show()
# +---+------------+
# |  x|count_of_one|
# +---+------------+
# |  a|           2|
# |  b|           1|
# |  c|           1|
# +---+------------+

If you want to count with no matching value (0). You can use count with conditions like below

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("a", "one"),
        ("a", "one"),
        ("a", "two"),
        ("b", "one"),
        ("b", "two"),
        ("c", "one"),
        ("d", "three"),
        ("d", "four"),]

# Create original df
df = spark.createDataFrame(data, ["x", "y"])

# Using count with condition (when) ~ group by + having in SQL
# Note that you can't just use count(col("y") == "one") since count will skip null values
result_df = df.groupBy("x").agg(count(when(col("y") == "one", True)).alias("count_one"))

result_df.show()
# +---+---------+
# |  x|count_one|
# +---+---------+
# |  a|        2|
# |  b|        1|
# |  c|        1|
# |  d|        0|
# +---+---------+

huangapple
  • 本文由 发表于 2023年3月9日 15:40:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/75681655.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定