2023年6月26日 01:25:26go评论69阅读模式

英文:

How can I summarize all columns of a polars dataframe

问题

在Polars中，要执行与Pandas中相同的操作，可以使用apply方法和agg方法来实现。以下是在Polars中执行相同操作的代码：

import polars as pl
import numpy as np

# Toy Data
data = {'a': [1, 2, 3, 4, 5], 'b': [2, 4, 6, 8, 10]}

# Create a Polars DataFrame
pdf = pl.DataFrame(data)

# Function I want to use to summarize my columns
my_func = lambda x: pl.log(x.mean())

# How to do this in Polars
result = pdf.agg(pl.col("a").apply(my_func).alias("a_summary"),
                 pl.col("b").apply(my_func).alias("b_summary"))

result.show()

这段代码中，我们首先创建了一个Polars DataFrame pdf，然后定义了要应用于列的自定义函数 my_func。接下来，我们使用agg方法来应用这个函数，并使用apply方法将其应用于每列，同时使用alias方法为每列创建一个摘要列。最后，我们使用show方法来显示结果。

英文:

Pandas makes it easy to summarize columns of a dataframe with an arbitrary function using df.apply(my_func, axis=0).

How can I do the same in polars? Shown below is a MWE. I have a function (just an example, I would like to do this for arbitrary functions) that I can apply to entire columns. The function summarizes columns in pandas using the syntax I've shown.

What is the syntax to perform the same operation in polars?

import polars as pl
import pandas as pd
import numpy as np

# Toy Data
data = {&#39;a&#39;:[1, 2, 3, 4, 5], 
        &#39;b&#39;: [2, 4, 6, 8, 10]}

# Pandas and polars copy
df = pd.DataFrame(data)
pdf = pl.DataFrame(data)

# Function I want to use to summarize my columns
my_func = lambda x: np.log(x.mean())

# How to do this in pandas
df.apply(my_func, axis=0)

# How do I do the same in polars?

答案1

得分: 2

请参阅 map。在这里，它必须在 select 上下文中使用，请查看书中的部分以获取更多注意事项。

pdf.select(pl.all().map(my_func))

英文:

See map. It must be used in the select context here, see the section in the book on more caveats.

pdf.select(pl.all().map(my_func))

答案2

得分: 2

你真的不应该在 Polars 中使用 Python 函数，因为 Polars 中有表达式可以实现你的目标。

data = {'a': [1, 2, 3, 4, 5],
        'b': [2, 4, 6, 8, 10]}

df = pl.DataFrame(data)

df.select(
    pl.all().mean().log()
)

每次使用 map 或 apply 都是一种代码异味，除非无法以不同的方式完成。

上下文

在 Polars 中计算任何内容的惯用方式是使用表达式。有许多原因应优先使用表达式：

它们并行运行
它们可以被优化
它们在 Rust 中编译

Python 函数对 Polars 来说是不透明的。它无法被优化，因为我们不知道它做什么，也不知道输出是什么。

OP 描述了它希望运行任何任意函数。这在表达式中是包括的。任何表达式都可以使用 map 或 apply，并接受 Python 函数作为逃生通道。因此，回答如何在所有列上运行表达式是回答如何在所有列上运行 Python 函数的超集。

英文:

You really shouldn't use python functions when there are expressions in polars that can achieve your goal.

data = {&#39;a&#39;:[1, 2, 3, 4, 5], 
        &#39;b&#39;: [2, 4, 6, 8, 10]}

df = pl.DataFrame(data)

df.select(
    pl.all().mean().log()
)

Every map or apply is a code smell and should be avoided unless it cannot be done differently.

Context

The idiomatic way to compute anything in polars is using expressions. They should be preferred for a number of reasons:

they run parallel
they can be optimized
they are compiled in rust

A python function is opaque to polars. It can not be optimized because we don't know what it does, nor what the output is.

OP describes it wants to run any arbitrary function. This is included in expressions. Any expression can take a map or apply and accept a python function as escape hatch. For this reason answering how you can run an expression on all columns is a superset of answering how you can run a python function on all columns.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

我要翻译的内容：如何总结一个 Polars 数据框的所有列

问题

答案1

答案2

上下文

Context

Jupyter cells go blank after scrolling in Vscode.

如何将R包安装到Ubuntu的Docker容器中。

在Python中将数据框更改为字符串

创建一个多级列数据透视表在 pandas 中。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论