英文:
Pyspark Pivot table with Multiple columns
问题
我试图在PySpark中对一个包含一个键和多个值的数据框进行透视。我之前使用了一个键值对进行了透视,并试图弄清楚如何实现。
示例数据框
id | test_id | test_status | key | score1 | score2 | score3 |
---|---|---|---|---|---|---|
ABC | 1 | complete | q1 | 1 | 2 | 3 |
ABC | 1 | complete | q2 | 4 | 5 | 6 |
ABC | 2 | complete | q1 | 1 | 6 | 7 |
ABC | 2 | complete | q2 | 5 | 6 | 7 |
期望的数据框
id | test_id | test_status | q1_score1 | q1_score2 | q1_score3 | q2_score1 | q2_score2 | q2_score3 | ||
---|---|---|---|---|---|---|---|---|---|---|
ABC | 1 | complete | 1 | 2 | 3 | 4 | 5 | 6 | ||
ABC | 2 | complete | 1 | 6 | 7 | 5 | 6 | 7 |
英文:
I am trying to pivot a dataframe with one key and multiple values across different columns . How do I do this in pyspark ? I have used pivot with one key value pair before and trying to figure this out.
Sample dataframe
id | test_id | test_status | key | score1 | score2 | score3 |
---|---|---|---|---|---|---|
ABC | 1 | complete | q1 | 1 | 2 | 3 |
ABC | 1 | complete | q2 | 4 | 5 | 6 |
ABC | 2 | complete | q1 | 1 | 6 | 7 |
ABC | 2 | complete | q2 | 5 | 6 | 7 |
expected dataframe
id | test_id | test_status | q1_score1 | q1_score2 | q1_score3 | q2_score1 | q2_score2 | q2_score3 | ||
---|---|---|---|---|---|---|---|---|---|---|
ABC | 1 | complete | 1 | 2 | 3 | 4 | 5 | 6 | ||
ABC | 2 | complete | 1 | 6 | 7 | 5 | 6 | 7 |
答案1
得分: 1
你可以执行多列数据透视。
df = (df.groupby('id', 'test_id', 'test_status')
.pivot('key')
.agg(*[F.first(x).alias(x) for x in ['score1', 'score2', 'score3']]))
英文:
You can do multiple columns pivot.
df = (df.groupby('id', 'test_id', 'test_status')
.pivot('key')
.agg(*[F.first(x).alias(x) for x in ['score1', 'score2', 'score3']]))
答案2
得分: 0
尝试使用**pivot
和first
**聚合函数。
示例:
df = spark.createDataFrame(['ABC','1','c','q1','1','2','3'],['id','test_id','test_status','key','score1','score2','score3'])
df.show(10,False)
df.groupBy("id","test_id","test_status").pivot("key").agg(first(col("score1")).alias("score1"),first(col("score2")).alias("score2"),first(col("score3")).alias("score3")).show(10,False)
#输入
#+---+-------+-----------+---+------+------+------+
#|id |test_id|test_status|key|score1|score2|score3|
#+---+-------+-----------+---+------+------+------+
#|ABC|1 |c |q1 |1 |2 |3 |
#|ABC|1 |c |q2 |4 |5 |6 |
#+---+-------+-----------+---+------+------+------+
#+---+-------+-----------+---------+---------+---------+---------+---------+---------+
#|id |test_id|test_status|q1_score1|q1_score2|q1_score3|q2_score1|q2_score2|q2_score3|
#+---+-------+-----------+---------+---------+---------+---------+---------+---------+
#|ABC|1 |c |1 |2 |3 |4 |5 |6 |
#+---+-------+-----------+---------+---------+---------+---------+---------+---------+
<details>
<summary>英文:</summary>
Try with **`pivot`** and **`first`** aggregate function.
**`Example:`**
df = spark.createDataFrame([('ABC','1','c','q1','1','2','3'),('ABC','1','c','q2','4','5','6')],['id','test_id','test_status','key','score1','score2','score3'])
df.show(10,False)
df.groupBy("id","test_id","test_status").pivot("key").agg(first(col("score1")).alias("score1"),first(col("score2")).alias("score2"),first(col("score3")).alias("score3")).show(10,False)
#input
#+---+-------+-----------+---+------+------+------+
#|id |test_id|test_status|key|score1|score2|score3|
#+---+-------+-----------+---+------+------+------+
#|ABC|1 |c |q1 |1 |2 |3 |
#|ABC|1 |c |q2 |4 |5 |6 |
#+---+-------+-----------+---+------+------+------+
#+---+-------+-----------+---------+---------+---------+---------+---------+---------+
#|id |test_id|test_status|q1_score1|q1_score2|q1_score3|q2_score1|q2_score2|q2_score3|
#+---+-------+-----------+---------+---------+---------+---------+---------+---------+
#|ABC|1 |c |1 |2 |3 |4 |5 |6 |
#+---+-------+-----------+---------+---------+---------+---------+---------+---------+
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论