2023年7月11日 04:24:28go评论78阅读模式

英文:

Apply minmax_scale to all columns in polars data frame

问题

我正在尝试遵循这个问题中的建议。

df = pl.DataFrame({'a':[1, 2, 3], 'b':[4,5,6]})
df.select([pl.all().map(np.log2)])
shape: (3, 2)
┌──────────┬──────────┐
│ a        ┆ b        │
│ ---      ┆ ---      │
│ f64      ┆ f64      │
╞══════════╪══════════╡
│ 0.0      ┆ 2.0      │
│ 1.0      ┆ 2.321928 │
│ 1.584963 ┆ 2.584963 │
└──────────┴──────────┘

到目前为止，一切都很好。但是：

from sklearn.preprocessing import minmax_scale
&gt;&gt;&gt; df.select(pl.all().map(minmax_scale))
shape: (1, 2)
┌─────────────────┬─────────────────┐
│ a               ┆ b               │
│ ---             ┆ ---             │
│ list[f64]       ┆ list[f64]       │
╞═════════════════╪═════════════════╡
│ [0.0, 0.5, 1.0] ┆ [0.0, 0.5, 1.0] │
└─────────────────┴─────────────────┘

我找到了一种将pl.List转换回来的方法，但似乎奇怪需要这一步。

df.select(pl.all().map(minmax_scale)).explode(pl.all())
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 0.0 ┆ 0.0 │
│ 0.5 ┆ 0.5 │
│ 1.0 ┆ 1.0 │
└─────┴─────┘

minmax_scale和np.log2都返回数组，所以我期望行为是相同的。这应该是正确的方式吗？

英文:

I am trying to follow advice from this question

df = pl.DataFrame({&#39;a&#39;:[1, 2, 3], &#39;b&#39;:[4,5,6]})
df.select([pl.all().map(np.log2)])
shape: (3, 2)
┌──────────┬──────────┐
│ a        ┆ b        │
│ ---      ┆ ---      │
│ f64      ┆ f64      │
╞══════════╪══════════╡
│ 0.0      ┆ 2.0      │
│ 1.0      ┆ 2.321928 │
│ 1.584963 ┆ 2.584963 │
└──────────┴──────────┘

So far, so good. But:

from sklearn.preprocessing import minmax_scale
&gt;&gt;&gt; df.select(pl.all().map(minmax_scale))
shape: (1, 2)
┌─────────────────┬─────────────────┐
│ a               ┆ b               │
│ ---             ┆ ---             │
│ list[f64]       ┆ list[f64]       │
╞═════════════════╪═════════════════╡
│ [0.0, 0.5, 1.0] ┆ [0.0, 0.5, 1.0] │
└─────────────────┴─────────────────┘

I found a way of converting the pl.List back, but it seems strange that this step is needed.

df.select(pl.all().map(minmax_scale)).explode(pl.all())
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 0.0 ┆ 0.0 │
│ 0.5 ┆ 0.5 │
│ 1.0 ┆ 1.0 │
└─────┴─────┘

Both minmax_scale and np.log2 return arrays, so I would expect the behavior to be the same. What is the proper way of doing this?

答案1

得分: 2

Alternatively: 为什么不自己进行缩放数学运算，而不是使用 map 或 apply，这样 Polars 就可以多线程处理它？

df.select(pl.all().log(2))

df.select((pl.all() - pl.all().min()) / (pl.all().max() - pl.all().min()))

英文:

Alternatively: why not do the scaling math yourself instead of a map or apply, so polars can multithread it?

df.select(pl.all().log(2))

df.select((pl.all() - pl.all().min()) / (pl.all().max() - pl.all().min()))

答案2

得分: 1

minmax_scale(df['a'])

array([0. , 0.5, 1. ])

Now do...

np.log2(df['a'])
shape: (3,)
Series: 'a' [f64]
[
0.0
1.0
1.584963
]

Notice how with 'log2' you get back a Series not an array.
The difference is that 'log2' is a true ufunc and so the output remains a series. You can see this directly with:

type(np.log2)

numpy.ufunc

type(minmax_scale)

function

You can do this to avoid the 'explode':

df.select(pl.all().map(lambda x: pl.Series(minmax_scale(x))))

or you can just define your own func in one line

my_minmax_scale = lambda x: pl.Series(minmax_scale(x))
df.select(pl.all().map(my_minmax_scale))
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 0.0 ┆ 0.0 │
│ 0.5 ┆ 0.5 │
│ 1.0 ┆ 1.0 │
└─────┴─────┘

英文:

Try to do this...

minmax_scale(df[&#39;a&#39;])
# array([0. , 0.5, 1. ])

Now do...

np.log2(df[&#39;a&#39;])
shape: (3,)
Series: &#39;a&#39; [f64]
[
    0.0
    1.0
    1.584963
]

Notice how with log2 you get back a Series not an array.
The difference is that log2 is a true ufunc and so the output remains a series. You can see this directly with:

type(np.log2)
# numpy.ufunc

type(minmax_scale)
# function

You can do this to avoid the explode:

df.select(pl.all().map(lambda x: pl.Series(minmax_scale(x))))

or you can just define your own func in one line

my_minmax_scale = lambda x: pl.Series(minmax_scale(x))
df.select(pl.all().map(my_minmax_scale))
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 0.0 ┆ 0.0 │
│ 0.5 ┆ 0.5 │
│ 1.0 ┆ 1.0 │
└─────┴─────┘

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将minmax_scale应用于polars数据框中的所有列。

问题

答案1

答案2

array([0. , 0.5, 1. ])

numpy.ufunc

function

Getting TypeError: WebDriver.init() got an unexpected keyword argument ‘desired_capabilities’ when using Appium with Selenium 4.10

在Selenium Python中的NoSuchElementException错误。

从两个或更多表中显示相关数据在一个DetailView中

行号基于两列重置的Python实现

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论