将元组转换为Spark中的矩阵。

huangapple go评论83阅读模式
英文:

Transform tuple to matrix in Spark

问题

I have an RDD列表,其中包含元组和值,看起来像这样。有成千上万种不同的配对。

(A, B), 1
(B, C), 2
(C, D), 1
(A, D), 1
(D, A), 5

我想将元组值对转换为与这些配对对应的矩阵。我在Spark中没有看到任何简单的方法来实现这个。

+---+------+------+------+------+
|   |  A   |  B   |  C   |  D   |
+---+------+------+------+------+
| A |  -   |  1   | NULL |  1   |
| B | NULL |  -   |  2   | NULL |
| C | NULL | NULL |  -   |  1   |
| D |  5   | NULL | NULL |  -   |
+---+------+------+------+------+
英文:

I have an rdd list of tuples and values that looks like this. There are thousands of different pairings.

(A, B), 1
(B, C), 2
(C, D), 1
(A, D), 1
(D, A), 5

I want to transform the tuple value pairs into a matrix that corresponds to the pairs. I didn't see any easy way to do this in spark.

+---+------+------+------+------+
|   |  A   |  B   |  C   |  D   |
+---+------+------+------+------+
| A | -    | 1    | NULL | 1    |
| B | NULL | -    | 2    | NULL |
| C | NULL |      | -    | 1    |
| D | 5    | NULL | NULL | -    |
+---+------+------+------+------+

答案1

得分: 1

以下是翻译好的内容:

最大努力,但无法使用spark-sql(如您所述)去除列名。
只是按自然顺序旋转。
试试,添加了额外的元组。

import org.apache.spark.sql.functions._
// 注意不确定("A", "B")、1或"A"、"B"、1之间的区别
val rdd = sc.parallelize(Seq((("A", "B"), 1), (("B", "C"), 2), (("C", "D"), 1), (("A", "D"), 1), (("D", "A"), 5), (("E", "Z"), 500)))

// 事实上可以从这里开始
val rdd2 = rdd.map(x => (x._1._1, x._1._2, x._2))

val df = rdd2.toDF()

// 自然排序,但无法去除DF(spark sql)中的 _1 列
df.groupBy("_1").pivot("_2").agg(first("_3"))
  .orderBy("_1")
  .show(false)

返回:

+---+----+----+----+----+----+
|_1 |A   |B   |C   |D   |Z   |
+---+----+----+----+----+----+
|A  |null|1   |null|1   |null|
|B  |null|null|2   |null|null|
|C  |null|null|null|1   |null|
|D  |5   |null|null|null|null|
|E  |null|null|null|null|500 |
+---+----+----+----+----+----+
英文:

Best effort, but cannot get rid of a column name using spark-sql (which you state).
Just pivoting with natural order.
Try it, added extra tuple.

import org.apache.spark.sql.functions._ 
// Note sure what difference is between ("A", "B"), 1 or "A", "B", 1
val rdd = sc.parallelize(Seq(  (("A", "B"), 1), (("B", "C"), 2), (("C", "D"), 1), (("A", "D"), 1), (("D", "A"), 5), (("E", "Z"), 500) ))

// Can start from here in fact
val rdd2 = rdd.map(x => (x._1._1, x._1._2, x._2))

val df = rdd2.toDF()

// Natural ordering, but cannot get rid of _1 column in a DF (spark sql)
df.groupBy("_1").pivot("_2").agg(first("_3"))
  .orderBy("_1")
  .show(false)

returns:

+---+----+----+----+----+----+
|_1 |A   |B   |C   |D   |Z   |
+---+----+----+----+----+----+
|A  |null|1   |null|1   |null|
|B  |null|null|2   |null|null|
|C  |null|null|null|1   |null|
|D  |5   |null|null|null|null|
|E  |null|null|null|null|500 |
+---+----+----+----+----+----+

huangapple
  • 本文由 发表于 2020年8月12日 07:02:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/63367509.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定