英文:
How does reduceByKey() in pyspark knows which column is key and which one is value?
问题
我是一个对Pyspark新手,正在阅读这个。
reduceByKey()
如何知道它应该将第一列视为键,第二列视为值,还是反过来呢?
我在reduceByKey()
的代码中没有看到任何列名或索引的提及。reduceByKey()
默认将第一列视为键,第二列视为值吗?
如果数据框中有多列,如何执行reduceByKey()
?
我知道可以使用df.select(col1, col2).reduceByKey()
。我只是想知道是否还有其他方法。
英文:
I'm a newbie to Pyspark and going through this.
How does reduceByKey()
know whether it should consider the first column as and second as value or vice-versa.
I don't see any column name or index mentioned in the code in reduceByKey()
. Does reduceByKey()
by default considers the first column as key and second as value?
How to perform reduceByKey()
if there are multiple columns in the dataframe?
I'm aware of df.select(col1, col2).reduceByKey()
. I'm just looking if there is any other way.
答案1
得分: 1
我不确定你使用的是哪个版本的Spark,但这并不太重要。我假设你使用的是最新版本3.4.1。
如果我们查看reduceByKey
函数在源代码中的函数签名,可以看到如下内容:
def reduceByKey(
self: "RDD[Tuple[K, V]]",
func: Callable[[V, V], V],
numPartitions: Optional[int] = None,
partitionFunc: Callable[[K], int] = portable_hash,
) -> "RDD[Tuple[K, V]]":
因此,这个函数期望你的RDD
的类型是Tuple[K, V]
,其中K
代表键(key),V
代表值(value)。
如果你要对具有多列的RDD
执行reduceByKey
操作,你可以将它们转换为一个单值列,该列本身是值的元组。
以你提供的网站数据为例,我们在数据中添加了一列:
data = [
("Project", 1, 2),
("Gutenberg’s", 1, 2),
("Alice’s", 1, 2),
("Adventures", 1, 2),
("in", 1, 2),
("Wonderland", 1, 2),
("Project", 1, 2),
("Gutenberg’s", 1, 2),
("Adventures", 1, 2),
("in", 1, 2),
("Wonderland", 1, 2),
("Project", 1, 2),
("Gutenberg’s", 1, 2),
]
让我们将rdd
转换为具有正确形状的RDD(2列,一个键列和一个值列):
rdd2 = rdd.map(lambda x: (x[0], (x[1], x[2])))
rdd2.collect()
[('Project', (1, 2)), ('Gutenberg’s', (1, 2)), ('Alice’s', (1, 2)), ('Adventures', (1, 2)), ('in', (1, 2)), ('Wonderland', (1, 2)), ('Project', (1, 2)), ('Gutenberg’s', (1, 2)), ('Adventures', (1, 2)), ('in', (1, 2)), ('Wonderland', (1, 2)), ('Project', (1, 2)), ('Gutenberg’s', (1, 2))]
假设我们想要对这两列进行归约,对于第一列我们想要求和(就像你提供的示例网站),对于第二列我们想要求乘积:
rdd3 = rdd2.reduceByKey(lambda a, b: (a[0] + b[0], a[1] * b[1]))
rdd3.collect()
[('Alice’s', (1, 2)), ('in', (2, 4)), ('Project', (3, 8)), ('Gutenberg’s', (3, 8)), ('Adventures', (2, 4)), ('Wonderland', (2, 4))]
英文:
I'm not sure of which version of Spark you are on, but it should not matter that much. I'll assume you are on version 3.4.1, which is the latest version as of the time of writing this.
If we take a look at the function signature of reduceByKey
in the source code, we see this:
def reduceByKey(
self: "RDD[Tuple[K, V]]",
func: Callable[[V, V], V],
numPartitions: Optional[int] = None,
partitionFunc: Callable[[K], int] = portable_hash,
) -> "RDD[Tuple[K, V]]":
So indeed, this function expects your RDD
to be of type Tuple[K, V]
where K
stands for key and V
stands for value.
Now, if you perform reduceByKey
where you have more columns, you can just turn them into a single value column that is of itself a tuple of values.
An example would be your example website's data, where we add an extra column:
data = [
("Project", 1, 2),
("Gutenberg’s", 1, 2),
("Alice’s", 1, 2),
("Adventures", 1, 2),
("in", 1, 2),
("Wonderland", 1, 2),
("Project", 1, 2),
("Gutenberg’s", 1, 2),
("Adventures", 1, 2),
("in", 1, 2),
("Wonderland", 1, 2),
("Project", 1, 2),
("Gutenberg’s", 1, 2),
]
Let's turn rdd
into an RDD with the correct shape (2 columns, a key and a value column):
rdd2 = rdd.map(lambda x: (x[0], (x[1], x[2])))
rdd2.collect()
[('Project', (1, 2)), ('Gutenberg’s', (1, 2)), ('Alice’s', (1, 2)), ('Adventures', (1, 2)), ('in', (1, 2)), ('Wonderland', (1, 2)), ('Project', (1, 2)), ('Gutenberg’s', (1, 2)), ('Adventures', (1, 2)), ('in', (1
, 2)), ('Wonderland', (1, 2)), ('Project', (1, 2)), ('Gutenberg’s', (1, 2))]
And let's say we want to reduce the 2 columns, but for the first column we want to do a sum (like in your example site) and the second one we want to multiply:
rdd3 = rdd2.reduceByKey(lambda a, b: (a[0] + b[0], a[1] * b[1]))
rdd3.collect()
[('Alice’s', (1, 2)), ('in', (2, 4)), ('Project', (3, 8)), ('Gutenberg’s', (3, 8)), ('Adventures', (2, 4)), ('Wonderland', (2, 4))]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论