英文:
Transform tuple to matrix in Spark
问题
I have an RDD列表,其中包含元组和值,看起来像这样。有成千上万种不同的配对。
(A, B), 1
(B, C), 2
(C, D), 1
(A, D), 1
(D, A), 5
我想将元组值对转换为与这些配对对应的矩阵。我在Spark中没有看到任何简单的方法来实现这个。
+---+------+------+------+------+
| | A | B | C | D |
+---+------+------+------+------+
| A | - | 1 | NULL | 1 |
| B | NULL | - | 2 | NULL |
| C | NULL | NULL | - | 1 |
| D | 5 | NULL | NULL | - |
+---+------+------+------+------+
英文:
I have an rdd list of tuples and values that looks like this. There are thousands of different pairings.
(A, B), 1
(B, C), 2
(C, D), 1
(A, D), 1
(D, A), 5
I want to transform the tuple value pairs into a matrix that corresponds to the pairs. I didn't see any easy way to do this in spark.
+---+------+------+------+------+
| | A | B | C | D |
+---+------+------+------+------+
| A | - | 1 | NULL | 1 |
| B | NULL | - | 2 | NULL |
| C | NULL | | - | 1 |
| D | 5 | NULL | NULL | - |
+---+------+------+------+------+
答案1
得分: 1
以下是翻译好的内容:
最大努力,但无法使用spark-sql(如您所述)去除列名。
只是按自然顺序旋转。
试试,添加了额外的元组。
import org.apache.spark.sql.functions._
// 注意不确定("A", "B")、1或"A"、"B"、1之间的区别
val rdd = sc.parallelize(Seq((("A", "B"), 1), (("B", "C"), 2), (("C", "D"), 1), (("A", "D"), 1), (("D", "A"), 5), (("E", "Z"), 500)))
// 事实上可以从这里开始
val rdd2 = rdd.map(x => (x._1._1, x._1._2, x._2))
val df = rdd2.toDF()
// 自然排序,但无法去除DF(spark sql)中的 _1 列
df.groupBy("_1").pivot("_2").agg(first("_3"))
.orderBy("_1")
.show(false)
返回:
+---+----+----+----+----+----+
|_1 |A |B |C |D |Z |
+---+----+----+----+----+----+
|A |null|1 |null|1 |null|
|B |null|null|2 |null|null|
|C |null|null|null|1 |null|
|D |5 |null|null|null|null|
|E |null|null|null|null|500 |
+---+----+----+----+----+----+
英文:
Best effort, but cannot get rid of a column name using spark-sql (which you state).
Just pivoting with natural order.
Try it, added extra tuple.
import org.apache.spark.sql.functions._
// Note sure what difference is between ("A", "B"), 1 or "A", "B", 1
val rdd = sc.parallelize(Seq( (("A", "B"), 1), (("B", "C"), 2), (("C", "D"), 1), (("A", "D"), 1), (("D", "A"), 5), (("E", "Z"), 500) ))
// Can start from here in fact
val rdd2 = rdd.map(x => (x._1._1, x._1._2, x._2))
val df = rdd2.toDF()
// Natural ordering, but cannot get rid of _1 column in a DF (spark sql)
df.groupBy("_1").pivot("_2").agg(first("_3"))
.orderBy("_1")
.show(false)
returns:
+---+----+----+----+----+----+
|_1 |A |B |C |D |Z |
+---+----+----+----+----+----+
|A |null|1 |null|1 |null|
|B |null|null|2 |null|null|
|C |null|null|null|1 |null|
|D |5 |null|null|null|null|
|E |null|null|null|null|500 |
+---+----+----+----+----+----+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论