2023年8月9日 18:01:35go评论195阅读模式

英文:

How does reduceByKey() in pyspark knows which column is key and which one is value?

问题

我是一个对Pyspark新手，正在阅读这个。

reduceByKey()如何知道它应该将第一列视为键，第二列视为值，还是反过来呢？

我在reduceByKey()的代码中没有看到任何列名或索引的提及。reduceByKey()默认将第一列视为键，第二列视为值吗？

如果数据框中有多列，如何执行reduceByKey()？

我知道可以使用df.select(col1, col2).reduceByKey()。我只是想知道是否还有其他方法。

英文:

I'm a newbie to Pyspark and going through this.

How does reduceByKey() know whether it should consider the first column as and second as value or vice-versa.

I don't see any column name or index mentioned in the code in reduceByKey(). Does reduceByKey() by default considers the first column as key and second as value?

How to perform reduceByKey() if there are multiple columns in the dataframe?

I'm aware of df.select(col1, col2).reduceByKey(). I'm just looking if there is any other way.

答案1

得分: 1

我不确定你使用的是哪个版本的Spark，但这并不太重要。我假设你使用的是最新版本3.4.1。

如果我们查看reduceByKey函数在源代码中的函数签名，可以看到如下内容：

    def reduceByKey(
        self: "RDD[Tuple[K, V]]",
        func: Callable[[V, V], V],
        numPartitions: Optional[int] = None,
        partitionFunc: Callable[[K], int] = portable_hash,
    ) -> "RDD[Tuple[K, V]]":

因此，这个函数期望你的RDD的类型是Tuple[K, V]，其中K代表键（key），V代表值（value）。

如果你要对具有多列的RDD执行reduceByKey操作，你可以将它们转换为一个单值列，该列本身是值的元组。

以你提供的网站数据为例，我们在数据中添加了一列：

data = [
    ("Project", 1, 2),
    ("Gutenberg’s", 1, 2),
    ("Alice’s", 1, 2),
    ("Adventures", 1, 2),
    ("in", 1, 2),
    ("Wonderland", 1, 2),
    ("Project", 1, 2),
    ("Gutenberg’s", 1, 2),
    ("Adventures", 1, 2),
    ("in", 1, 2),
    ("Wonderland", 1, 2),
    ("Project", 1, 2),
    ("Gutenberg’s", 1, 2),
]

让我们将rdd转换为具有正确形状的RDD（2列，一个键列和一个值列）：

rdd2 = rdd.map(lambda x: (x[0], (x[1], x[2])))

rdd2.collect()
[('Project', (1, 2)), ('Gutenberg’s', (1, 2)), ('Alice’s', (1, 2)), ('Adventures', (1, 2)), ('in', (1, 2)), ('Wonderland', (1, 2)), ('Project', (1, 2)), ('Gutenberg’s', (1, 2)), ('Adventures', (1, 2)), ('in', (1, 2)), ('Wonderland', (1, 2)), ('Project', (1, 2)), ('Gutenberg’s', (1, 2))]

假设我们想要对这两列进行归约，对于第一列我们想要求和（就像你提供的示例网站），对于第二列我们想要求乘积：

rdd3 = rdd2.reduceByKey(lambda a, b: (a[0] + b[0], a[1] * b[1]))

rdd3.collect()
[('Alice’s', (1, 2)), ('in', (2, 4)), ('Project', (3, 8)), ('Gutenberg’s', (3, 8)), ('Adventures', (2, 4)), ('Wonderland', (2, 4))]

英文:

I'm not sure of which version of Spark you are on, but it should not matter that much. I'll assume you are on version 3.4.1, which is the latest version as of the time of writing this.

If we take a look at the function signature of reduceByKey in the source code, we see this:

    def reduceByKey(
        self: &quot;RDD[Tuple[K, V]]&quot;,
        func: Callable[[V, V], V],
        numPartitions: Optional[int] = None,
        partitionFunc: Callable[[K], int] = portable_hash,
    ) -&gt; &quot;RDD[Tuple[K, V]]&quot;:

So indeed, this function expects your RDD to be of type Tuple[K, V] where K stands for key and V stands for value.

Now, if you perform reduceByKey where you have more columns, you can just turn them into a single value column that is of itself a tuple of values.

An example would be your example website's data, where we add an extra column:

data = [
    (&quot;Project&quot;, 1, 2),
    (&quot;Gutenberg’s&quot;, 1, 2),
    (&quot;Alice’s&quot;, 1, 2),
    (&quot;Adventures&quot;, 1, 2),
    (&quot;in&quot;, 1, 2),
    (&quot;Wonderland&quot;, 1, 2),
    (&quot;Project&quot;, 1, 2),
    (&quot;Gutenberg’s&quot;, 1, 2),
    (&quot;Adventures&quot;, 1, 2),
    (&quot;in&quot;, 1, 2),
    (&quot;Wonderland&quot;, 1, 2),
    (&quot;Project&quot;, 1, 2),
    (&quot;Gutenberg’s&quot;, 1, 2),
]

Let's turn rdd into an RDD with the correct shape (2 columns, a key and a value column):

rdd2 = rdd.map(lambda x: (x[0], (x[1], x[2])))

rdd2.collect()
[(&#39;Project&#39;, (1, 2)), (&#39;Gutenberg’s&#39;, (1, 2)), (&#39;Alice’s&#39;, (1, 2)), (&#39;Adventures&#39;, (1, 2)), (&#39;in&#39;, (1, 2)), (&#39;Wonderland&#39;, (1, 2)), (&#39;Project&#39;, (1, 2)), (&#39;Gutenberg’s&#39;, (1, 2)), (&#39;Adventures&#39;, (1, 2)), (&#39;in&#39;, (1
, 2)), (&#39;Wonderland&#39;, (1, 2)), (&#39;Project&#39;, (1, 2)), (&#39;Gutenberg’s&#39;, (1, 2))]

And let's say we want to reduce the 2 columns, but for the first column we want to do a sum (like in your example site) and the second one we want to multiply:

rdd3 = rdd2.reduceByKey(lambda a, b: (a[0] + b[0], a[1] * b[1]))

rdd3.collect()
[(&#39;Alice’s&#39;, (1, 2)), (&#39;in&#39;, (2, 4)), (&#39;Project&#39;, (3, 8)), (&#39;Gutenberg’s&#39;, (3, 8)), (&#39;Adventures&#39;, (2, 4)), (&#39;Wonderland&#39;, (2, 4))]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How does reduceByKey() in pyspark knows which column is key and which one is value?

问题

答案1

如何在 pandas 中连接具有不同列集的两个数据框。

在列1与列2进行比对时，识别并移除列2中的重复项。

subprocess.Popen抛出FileNotFoundError，但os.system成功运行命令。

Trying to Optimize Process Using Linear Programming. Getting error about: IndexError: index 1 is out of bounds for axis 0 with size 1

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论