2023年7月7日 01:34:44go评论96阅读模式

英文:

Pyspark - How to get count of a particular element in an array without exploding?

问题

|姓名|动作|走步次数|跑步次数|
|-----|----|----|-----|
|a|[走,跑,坐,走,跑,坐]|2|2|
|b|[走,坐,跑,走,睡,等待]|2|0|

英文:

Input dataframe:

Name	action
a	[walk,run,sit,walk,run,sit]
b	[walk,sit,run,walk,sleep,wait]

Calculate action count of walk and run without exploding the array like below output dataframe.

Name	action	walk_count	run_count
a	[walk,run,sit,walk,run,sit]	2	2
b	[walk,sit,run,walk,sleep,wait]	2	0

答案1

得分: 2

尝试在pyspark中使用higher order **filter**函数。

size(filter(action, n -> n == 'walk')) - 获取数组中每个项，然后仅过滤匹配的记录，然后获取数组的大小。

示例：

df = spark.createDataFrame([('a', ['walk', 'run', 'sit', 'walk', 'run', 'sit'])], ['name', 'action'])
df.withColumn("walk_count", expr("size(filter(action, n -> n == 'walk'))")).
   withColumn("run_count", expr("size(filter(action, n -> n == 'run'))")).
   show(10, False)

+----+--------------------------------+----------+---------+
|name|action                          |walk_count|run_count|
+----+--------------------------------+----------+---------+
|a   |[walk, run, sit, walk, run, sit]|2         |2        |
+----+--------------------------------+----------+---------+

英文:

Try with higher order filter function in pyspark.

size(filter(action,n -> n == 'walk')) -> get the array each item then filter only the matching record then get the size of array

Example:

df = spark.createDataFrame([(&#39;a&#39;,[&#39;walk&#39;,&#39;run&#39;,&#39;sit&#39;,&#39;walk&#39;,&#39;run&#39;,&#39;sit&#39;])],[&#39;name&#39;,&#39;action&#39;])
df.withColumn(&quot;walk_count&quot;, expr(&quot;&quot;&quot;size(filter(action,n -&gt; n == &#39;walk&#39;))&quot;&quot;&quot;)).\
withColumn(&quot;run_count&quot;, expr(&quot;&quot;&quot;size(filter(action,n -&gt; n == &#39;run&#39;))&quot;&quot;&quot;)).\
show(10,False)
#+----+--------------------------------+----------+---------+
#|name|action                          |walk_count|run_count|
#+----+--------------------------------+----------+---------+
#|a   |[walk, run, sit, walk, run, sit]|2         |2        |
#+----+--------------------------------+----------+---------+

答案2

得分: 1

你可以按照以下方式使用aggregate函数：

agg_cols = ['walk', 'run']
for col in agg_cols:
    df = df.withColumn(f'{col}_count', f.expr(f"aggregate(action, 0, (acc, x) -> if(x = '{col}', acc + 1, acc))"))
df.show(truncate=False)
+----+--------------------------------+----------+---------+
|name|action                          |walk_count|run_count|
+----+--------------------------------+----------+---------+
|a   |[walk, run, sit, walk, run, sit]|2         |2        |
+----+--------------------------------+----------+---------+

英文:

You can use the aggregate function as follows:

agg_cols = [&#39;walk&#39;, &#39;run&#39;]
for col in agg_cols:
    df = df.withColumn(f&#39;{col}_count&#39;, f.expr(f&quot;aggregate(action, 0, (acc, x) -&gt; if(x = &#39;{col}&#39;, acc + 1, acc))&quot;))
df.show(truncate=False)
+----+--------------------------------+----------+---------+
|name|action                          |walk_count|run_count|
+----+--------------------------------+----------+---------+
|a   |[walk, run, sit, walk, run, sit]|2         |2        |
+----+--------------------------------+----------+---------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pyspark – 如何在不使用explode的情况下获取数组中特定元素的计数？

问题

答案1

答案2

如何在React映射函数循环中使用<br>标签

JSON 数据重建的逻辑

Read entire file of newline delimited JSON blobs to memory and unmarshal each blob with the least amount of conversions in golang?

.items() 读取一个字典作为一个列表，并引发属性错误。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。