英文:
Pyspark - How to get count of a particular element in an array without exploding?
问题
|姓名|动作|走步次数|跑步次数|
|-----|----|----|-----|
|a|[走,跑,坐,走,跑,坐]|2|2|
|b|[走,坐,跑,走,睡,等待]|2|0|
英文:
Input dataframe:
Name | action |
---|---|
a | [walk,run,sit,walk,run,sit] |
b | [walk,sit,run,walk,sleep,wait] |
Calculate action count of walk and run without exploding the array like below output dataframe.
Name | action | walk_count | run_count |
---|---|---|---|
a | [walk,run,sit,walk,run,sit] | 2 | 2 |
b | [walk,sit,run,walk,sleep,wait] | 2 | 0 |
答案1
得分: 2
尝试在pyspark中使用higher order
**filter
**函数。
size(filter(action, n -> n == 'walk'))
- 获取数组中每个项,然后仅过滤匹配的记录,然后获取数组的大小。
示例:
df = spark.createDataFrame([('a', ['walk', 'run', 'sit', 'walk', 'run', 'sit'])], ['name', 'action'])
df.withColumn("walk_count", expr("size(filter(action, n -> n == 'walk'))")).
withColumn("run_count", expr("size(filter(action, n -> n == 'run'))")).
show(10, False)
+----+--------------------------------+----------+---------+
|name|action |walk_count|run_count|
+----+--------------------------------+----------+---------+
|a |[walk, run, sit, walk, run, sit]|2 |2 |
+----+--------------------------------+----------+---------+
英文:
Try with higher order
filter
function in pyspark.
size(filter(action,n -> n == 'walk'))
-> get the array each item then filter only the matching record then get the size of array
Example:
df = spark.createDataFrame([('a',['walk','run','sit','walk','run','sit'])],['name','action'])
df.withColumn("walk_count", expr("""size(filter(action,n -> n == 'walk'))""")).\
withColumn("run_count", expr("""size(filter(action,n -> n == 'run'))""")).\
show(10,False)
#+----+--------------------------------+----------+---------+
#|name|action |walk_count|run_count|
#+----+--------------------------------+----------+---------+
#|a |[walk, run, sit, walk, run, sit]|2 |2 |
#+----+--------------------------------+----------+---------+
答案2
得分: 1
你可以按照以下方式使用aggregate
函数:
agg_cols = ['walk', 'run']
for col in agg_cols:
df = df.withColumn(f'{col}_count', f.expr(f"aggregate(action, 0, (acc, x) -> if(x = '{col}', acc + 1, acc))"))
df.show(truncate=False)
+----+--------------------------------+----------+---------+
|name|action |walk_count|run_count|
+----+--------------------------------+----------+---------+
|a |[walk, run, sit, walk, run, sit]|2 |2 |
+----+--------------------------------+----------+---------+
英文:
You can use the aggregate
function as follows:
agg_cols = ['walk', 'run']
for col in agg_cols:
df = df.withColumn(f'{col}_count', f.expr(f"aggregate(action, 0, (acc, x) -> if(x = '{col}', acc + 1, acc))"))
df.show(truncate=False)
+----+--------------------------------+----------+---------+
|name|action |walk_count|run_count|
+----+--------------------------------+----------+---------+
|a |[walk, run, sit, walk, run, sit]|2 |2 |
+----+--------------------------------+----------+---------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论