问题

我有一个包含多列的数据框：

+-------------+--------+
|         x   |      y |
+-------------+--------+
|            a|     one| 
|            a|     one|
|            a|     two|
|            b|     one|
|            b|     two|  
|            c|     one| 
+-------------+--------+

我想要按x分组，并对每个x组计算“one”出现的次数。

类似于：

df.groupBy(x).agg(countDistinct("one")).collect()

输出将是：2, 1, 1，因为“one”在组a中出现两次，在组b和c中各出现一次。

英文:

I have a dataframe with multiple columns:

+-------------+--------+
|         x   |      y |
+-------------+--------+
|            a|     one| 
|            a|     one|
|            a|     two|
|            b|     one|
|            b|     two|  
|            c|     one| 
+-------------+--------+

I would like to group by x and for each group of x count the number of times "one" occurs.

something like:

df.groupBy(x).agg(countDistinct(&quot;one&quot;)).collect()

the output would be: 2, 1, 1 since "one" occurs twice for group a and once for groups b and c

答案1

得分: 1

尝试在“group by”之前使用“filter”

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("a", "one"),
        ("a", "one"),
        ("a", "two"),
        ("b", "one"),
        ("b", "two"),
        ("c", "one")]
# 创建原始数据框
df = spark.createDataFrame(data, ["x", "y"])

# 在group_by之前使用filter
result_df = (df.filter(col('y') == 'one')
                .groupBy('x')
                .agg(count('*').alias('count_of_one')))

result_df.show()
# +---+------------+
# |  x|count_of_one|
# +---+------------+
# |  a|           2|
# |  b|           1|
# |  c|           1|
# +---+------------+

如果要计算没有匹配值的情况（0），您可以使用以下条件进行计数

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("a", "one"),
        ("a", "one"),
        ("a", "two"),
        ("b", "one"),
        ("b", "two"),
        ("c", "one"),
        ("d", "three"),
        ("d", "four")]
# 创建原始数据框
df = spark.createDataFrame(data, ["x", "y"])

# 使用条件进行计数（when）~ group by + having in SQL
# 请注意，您不能只使用count(col("y") == "one")，因为count会跳过null值
result_df = df.groupBy("x").agg(count(when(col("y") == "one", True)).alias("count_one"))

result_df.show()
# +---+---------+
# |  x|count_one|
# +---+---------+
# |  a|        2|
# |  b|        1|
# |  c|        1|
# |  d|        0|
# +---+---------+

英文:

Try to use filter before group by

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count

spark = SparkSession.builder.appName(&quot;example&quot;).getOrCreate()

data = [(&quot;a&quot;, &quot;one&quot;),
        (&quot;a&quot;, &quot;one&quot;),
        (&quot;a&quot;, &quot;two&quot;),
        (&quot;b&quot;, &quot;one&quot;),
        (&quot;b&quot;, &quot;two&quot;),
        (&quot;c&quot;, &quot;one&quot;)]
# Create original df
df = spark.createDataFrame(data, [&quot;x&quot;, &quot;y&quot;])

# Using filter before group_by
result_df = (df.filter(col(&#39;y&#39;) == &#39;one&#39;)
                .groupBy(&#39;x&#39;)
                .agg(count(&#39;*&#39;).alias(&#39;count_of_one&#39;)))

result_df.show()
# +---+------------+
# |  x|count_of_one|
# +---+------------+
# |  a|           2|
# |  b|           1|
# |  c|           1|
# +---+------------+

If you want to count with no matching value (0). You can use count with conditions like below

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.appName(&quot;example&quot;).getOrCreate()

data = [(&quot;a&quot;, &quot;one&quot;),
        (&quot;a&quot;, &quot;one&quot;),
        (&quot;a&quot;, &quot;two&quot;),
        (&quot;b&quot;, &quot;one&quot;),
        (&quot;b&quot;, &quot;two&quot;),
        (&quot;c&quot;, &quot;one&quot;),
        (&quot;d&quot;, &quot;three&quot;),
        (&quot;d&quot;, &quot;four&quot;),]

# Create original df
df = spark.createDataFrame(data, [&quot;x&quot;, &quot;y&quot;])

# Using count with condition (when) ~ group by + having in SQL
# Note that you can&#39;t just use count(col(&quot;y&quot;) == &quot;one&quot;) since count will skip null values
result_df = df.groupBy(&quot;x&quot;).agg(count(when(col(&quot;y&quot;) == &quot;one&quot;, True)).alias(&quot;count_one&quot;))

result_df.show()
# +---+---------+
# |  x|count_one|
# +---+---------+
# |  a|        2|
# |  b|        1|
# |  c|        1|
# |  d|        0|
# +---+---------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

按另一列分组计算列中出现的次数

问题

答案1

Spark ETL大数据传输 – 如何并行化

禁用双引号

I’ll provide the translation as requested: 创建新的Spark列基于字典的值

从数据框中选择随机行。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论