英文:
How to use numpy function to add polars dataframe column
问题
这是我之前问题的延续。用户glebcom帮助我将坐标从字符串转换为float64值的列表。在回答中,我找到了两种计算坐标之间距离的方法:
- 使用公式numpy.linalg.norm(a-b)
- 使用from scipy.spatial import distance:dst = distance.euclidean(a, b)
如何应用这些公式之一来计算来自极坐标数据框的列c和d之间的距离:
import polars as pl
from scipy.spatial import distance
import numpy as np
pl.Config.set_fmt_str_lengths(2000)
data={"a": ["782.83 7363.51 6293 40 PD","850.68 7513.1 6262.17 40 PD"], "b": ["795.88 7462.65 6293 40 PD","1061.64 7486.08 6124.85 40 PD"]}
df=pl.DataFrame(data)
df=df.with_columns([
pl.col("a").str.replace_all(r" +", " ")\
.str.split(" ").arr.slice(0,3)\
.cast(pl.List(pl.Float64)).alias("c"),\
pl.col("b").str.replace_all(r" +", " ")\
.str.split(" ").arr.slice(0,3)\
.cast(pl.List(pl.Float64)).alias("d")\
])
print(df)
我的尝试如下:
df=df.with_columns(np.linalg.norm(pl.col("c")-pl.col("d")).alias("distance"))
或
df=df.with_columns(distance(pl.col("c"),pl.col("d")).alias("distance"))
但上述任何一种方法都不起作用。感谢您的帮助。
Artur
英文:
This is the continuation of my previous question
User glebcom helped me with transition of coordinates from a string to list of float64 values.
In the answer I found 2 methods how to calculate distance between coordinates:
- using formula numpy.linalg.norm(a-b)
- using from scipy.spatial import distance:dst = distance.euclidean(a, b)
How to apply one of these formulas to calculate the distance between corrdinates from column c and d from polars data frame
import polars as pl
from scipy.spatial import distance
import numpy as np
pl.Config.set_fmt_str_lengths(2000)
data={"a": ["782.83 7363.51 6293 40 PD","850.68 7513.1 6262.17 40 PD"], "b": ["795.88 7462.65 6293 40 PD","1061.64 7486.08 6124.85 40 PD"]}
df=pl.DataFrame(data)
df=df.with_columns([
pl.col("a").str.replace_all(r" +", " ")\
.str.split(" ").arr.slice(0,3)\
.cast(pl.List(pl.Float64)).alias("c"),\
pl.col("b").str.replace_all(r" +", " ")\
.str.split(" ").arr.slice(0,3)\
.cast(pl.List(pl.Float64)).alias("d")\
])
print(df)
My tries were
df=df.with_columns(np.linalg.norm(pl.col("C")-pl.col("d")).alias("distance"))
or
df=df.with_columns(distance(pl.col("C"),pl.col("d")).alias("distance"))
but none of the above works.
Thanks in advance for your assistance.
Artur
答案1
得分: 3
你无法直接在你的Polars数据框上调用 numpy.linalg.norm
。它需要一个形状为(N, n)的NumPy数组(其中N是点的数量,n是维度的数量,为3)。
你可以自己准备数据,将其传递给NumPy,然后将结果放回Polars。
首先,计算两个点之间在所有3个维度上的坐标差异:
diffs = df.select(
[
(pl.col("c").arr.get(i) - pl.col("d").arr.get(i)).alias(f"diff_{i}")
for i in range(3)
]
)
然后将其转换为NumPy并调用函数:
import numpy.linalg
distance = numpy.linalg.norm(diffs.to_numpy(), axis=1)
pl.Series(distance).alias("distance")
或者你可以自己计算欧氏距离:
df.select(
[
(pl.col("c").arr.get(i) - pl.col("d").arr.get(i)).alias(f"diff_{i}") ** 2
for i in range(3)
]
).sum(axis=1).sqrt()
PS:scipy.spatial.distance.euclidean
不适用,因为它一次只能处理一个点,这将在Polars中变得非常慢。
英文:
You won't be able to call numpy.linalg.norm
directly on your polars data frame. It expects a numpy array of shape (N, n) (where N is your number of points and n is your number of dimension, 3).
You can prepare the data your self, pass it to numpy and put back the results in polars.
First, calculate the difference between the coordinates of your two points, across all 3 dimensions:
diffs = df.select(
[
(pl.col("c").arr.get(i) - pl.col("d").arr.get(i)).alias(f"diff_{i}")
for i in range(3)
]
)
┌─────────┬────────┬────────┐
│ diff_0 ┆ diff_1 ┆ diff_2 │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞═════════╪════════╪════════╡
│ -13.05 ┆ -99.14 ┆ 0.0 │
│ -210.96 ┆ 27.02 ┆ 137.32 │
└─────────┴────────┴────────┘
Then convert it to numpy and call the function:
import numpy.linalg
distance=numpy.linalg.norm(diffs.to_numpy(), axis=1)
pl.Series(distance).alias("distance")
┌────────────┐
│ distance │
│ --- │
│ f64 │
╞════════════╡
│ 99.99521 │
│ 253.161973 │
└────────────┘
Alternatively you can calculate the euclidian product yourself:
df.select(
[
(pl.col("c").arr.get(i) - pl.col("d").arr.get(i)).alias(f"diff_{i}") ** 2
for i in range(3)
]
).sum(axis=1).sqrt()
┌────────────┐
│ distance │
│ --- │
│ f64 │
╞════════════╡
│ 99.99521 │
│ 253.161973 │
└────────────┘
ps: scipy.spatial.distance.euclidean
won't work because it only works with one point at time which would make it very slow in polars.
答案2
得分: 1
使用np.linalg.norm
在map
内的解决方案
def l2_norm(s: pl.Series) -> pl.Series:
# 1) difference: c-d
diff = s.struct.field("c").to_numpy() - s.struct.field("d").to_numpy()
# 2) apply np.linalg.norm()
return pl.Series(diff).apply(
lambda x: np.linalg.norm(np.array(x))
)
df.with_columns([
pl.struct(["c", "d"]).map(l2_norm).alias("distance")
])
┌────────────┐
│ distance │
│ --- │
│ f64 │
╞════════════╡
│ 99.99521 │
│ 253.161973 │
└────────────┘
英文:
Solution with np.linalg.norm
inside map
def l2_norm(s: pl.Series) -> pl.Series:
# 1) difference: c-d
diff = s.struct.field("c").to_numpy() - s.struct.field("d").to_numpy()
# 2) apply np.linalg.norm()
return pl.Series(diff).apply(
lambda x: np.linalg.norm(np.array(x))
)
df.with_columns([
pl.struct(["c", "d"]).map(l2_norm).alias("distance")
])
┌────────────┐
│ distance │
│ --- │
│ f64 │
╞════════════╡
│ 99.99521 │
│ 253.161973 │
└────────────┘
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论