英文:
Count number of occurrences in column grouped by another column
问题
我有一个包含多列的数据框:
+-------------+--------+
| x | y |
+-------------+--------+
| a| one|
| a| one|
| a| two|
| b| one|
| b| two|
| c| one|
+-------------+--------+
我想要按x
分组,并对每个x
组计算“one”出现的次数。
类似于:
df.groupBy(x).agg(countDistinct("one")).collect()
输出将是:2, 1, 1,因为“one”在组a
中出现两次,在组b
和c
中各出现一次。
英文:
I have a dataframe with multiple columns:
+-------------+--------+
| x | y |
+-------------+--------+
| a| one|
| a| one|
| a| two|
| b| one|
| b| two|
| c| one|
+-------------+--------+
I would like to group by x
and for each group of x
count the number of times "one" occurs.
something like:
df.groupBy(x).agg(countDistinct("one")).collect()
the output would be: 2, 1, 1 since "one" occurs twice for group a
and once for groups b
and c
答案1
得分: 1
尝试在“group by”之前使用“filter”
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count
spark = SparkSession.builder.appName("example").getOrCreate()
data = [("a", "one"),
("a", "one"),
("a", "two"),
("b", "one"),
("b", "two"),
("c", "one")]
# 创建原始数据框
df = spark.createDataFrame(data, ["x", "y"])
# 在group_by之前使用filter
result_df = (df.filter(col('y') == 'one')
.groupBy('x')
.agg(count('*').alias('count_of_one')))
result_df.show()
# +---+------------+
# | x|count_of_one|
# +---+------------+
# | a| 2|
# | b| 1|
# | c| 1|
# +---+------------+
如果要计算没有匹配值的情况(0),您可以使用以下条件进行计数
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("example").getOrCreate()
data = [("a", "one"),
("a", "one"),
("a", "two"),
("b", "one"),
("b", "two"),
("c", "one"),
("d", "three"),
("d", "four")]
# 创建原始数据框
df = spark.createDataFrame(data, ["x", "y"])
# 使用条件进行计数(when)~ group by + having in SQL
# 请注意,您不能只使用count(col("y") == "one"),因为count会跳过null值
result_df = df.groupBy("x").agg(count(when(col("y") == "one", True)).alias("count_one"))
result_df.show()
# +---+---------+
# | x|count_one|
# +---+---------+
# | a| 2|
# | b| 1|
# | c| 1|
# | d| 0|
# +---+---------+
英文:
Try to use filter
before group by
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count
spark = SparkSession.builder.appName("example").getOrCreate()
data = [("a", "one"),
("a", "one"),
("a", "two"),
("b", "one"),
("b", "two"),
("c", "one")]
# Create original df
df = spark.createDataFrame(data, ["x", "y"])
# Using filter before group_by
result_df = (df.filter(col('y') == 'one')
.groupBy('x')
.agg(count('*').alias('count_of_one')))
result_df.show()
# +---+------------+
# | x|count_of_one|
# +---+------------+
# | a| 2|
# | b| 1|
# | c| 1|
# +---+------------+
If you want to count with no matching value (0). You can use count with conditions like below
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("example").getOrCreate()
data = [("a", "one"),
("a", "one"),
("a", "two"),
("b", "one"),
("b", "two"),
("c", "one"),
("d", "three"),
("d", "four"),]
# Create original df
df = spark.createDataFrame(data, ["x", "y"])
# Using count with condition (when) ~ group by + having in SQL
# Note that you can't just use count(col("y") == "one") since count will skip null values
result_df = df.groupBy("x").agg(count(when(col("y") == "one", True)).alias("count_one"))
result_df.show()
# +---+---------+
# | x|count_one|
# +---+---------+
# | a| 2|
# | b| 1|
# | c| 1|
# | d| 0|
# +---+---------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论