英文:
Apply minmax_scale to all columns in polars data frame
问题
我正在尝试遵循这个问题中的建议。
df = pl.DataFrame({'a':[1, 2, 3], 'b':[4,5,6]})
df.select([pl.all().map(np.log2)])
shape: (3, 2)
┌──────────┬──────────┐
│ a ┆ b │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞══════════╪══════════╡
│ 0.0 ┆ 2.0 │
│ 1.0 ┆ 2.321928 │
│ 1.584963 ┆ 2.584963 │
└──────────┴──────────┘
到目前为止,一切都很好。但是:
from sklearn.preprocessing import minmax_scale
>>> df.select(pl.all().map(minmax_scale))
shape: (1, 2)
┌─────────────────┬─────────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ list[f64] ┆ list[f64] │
╞═════════════════╪═════════════════╡
│ [0.0, 0.5, 1.0] ┆ [0.0, 0.5, 1.0] │
└─────────────────┴─────────────────┘
我找到了一种将pl.List
转换回来的方法,但似乎奇怪需要这一步。
df.select(pl.all().map(minmax_scale)).explode(pl.all())
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 0.0 ┆ 0.0 │
│ 0.5 ┆ 0.5 │
│ 1.0 ┆ 1.0 │
└─────┴─────┘
minmax_scale
和np.log2
都返回数组,所以我期望行为是相同的。这应该是正确的方式吗?
英文:
I am trying to follow advice from this question
df = pl.DataFrame({'a':[1, 2, 3], 'b':[4,5,6]})
df.select([pl.all().map(np.log2)])
shape: (3, 2)
┌──────────┬──────────┐
│ a ┆ b │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞══════════╪══════════╡
│ 0.0 ┆ 2.0 │
│ 1.0 ┆ 2.321928 │
│ 1.584963 ┆ 2.584963 │
└──────────┴──────────┘
So far, so good. But:
from sklearn.preprocessing import minmax_scale
>>> df.select(pl.all().map(minmax_scale))
shape: (1, 2)
┌─────────────────┬─────────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ list[f64] ┆ list[f64] │
╞═════════════════╪═════════════════╡
│ [0.0, 0.5, 1.0] ┆ [0.0, 0.5, 1.0] │
└─────────────────┴─────────────────┘
I found a way of converting the pl.List
back, but it seems strange that this step is needed.
df.select(pl.all().map(minmax_scale)).explode(pl.all())
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 0.0 ┆ 0.0 │
│ 0.5 ┆ 0.5 │
│ 1.0 ┆ 1.0 │
└─────┴─────┘
Both minmax_scale
and np.log2
return arrays, so I would expect the behavior to be the same. What is the proper way of doing this?
答案1
得分: 2
Alternatively: 为什么不自己进行缩放数学运算,而不是使用 map
或 apply
,这样 Polars 就可以多线程处理它?
df.select(pl.all().log(2))
df.select((pl.all() - pl.all().min()) / (pl.all().max() - pl.all().min()))
英文:
Alternatively: why not do the scaling math yourself instead of a map
or apply
, so polars can multithread it?
df.select(pl.all().log(2))
df.select((pl.all() - pl.all().min()) / (pl.all().max() - pl.all().min()))
答案2
得分: 1
minmax_scale(df['a'])
array([0. , 0.5, 1. ])
Now do...
np.log2(df['a'])
shape: (3,)
Series: 'a' [f64]
[
0.0
1.0
1.584963
]
Notice how with 'log2' you get back a Series not an array.
The difference is that 'log2' is a true ufunc and so the output remains a series. You can see this directly with:
type(np.log2)
numpy.ufunc
type(minmax_scale)
function
You can do this to avoid the 'explode':
df.select(pl.all().map(lambda x: pl.Series(minmax_scale(x))))
or you can just define your own func in one line
my_minmax_scale = lambda x: pl.Series(minmax_scale(x))
df.select(pl.all().map(my_minmax_scale))
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 0.0 ┆ 0.0 │
│ 0.5 ┆ 0.5 │
│ 1.0 ┆ 1.0 │
└─────┴─────┘
英文:
Try to do this...
minmax_scale(df['a'])
# array([0. , 0.5, 1. ])
Now do...
np.log2(df['a'])
shape: (3,)
Series: 'a' [f64]
[
0.0
1.0
1.584963
]
Notice how with log2
you get back a Series not an array.
The difference is that log2
is a true ufunc and so the output remains a series. You can see this directly with:
type(np.log2)
# numpy.ufunc
type(minmax_scale)
# function
You can do this to avoid the explode
:
df.select(pl.all().map(lambda x: pl.Series(minmax_scale(x))))
or you can just define your own func in one line
my_minmax_scale = lambda x: pl.Series(minmax_scale(x))
df.select(pl.all().map(my_minmax_scale))
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 0.0 ┆ 0.0 │
│ 0.5 ┆ 0.5 │
│ 1.0 ┆ 1.0 │
└─────┴─────┘
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论