如何使用NumPy函数添加Polar数据框的列

huangapple go评论59阅读模式
英文:

How to use numpy function to add polars dataframe column

问题

这是我之前问题的延续。用户glebcom帮助我将坐标从字符串转换为float64值的列表。在回答中,我找到了两种计算坐标之间距离的方法:

  1. 使用公式numpy.linalg.norm(a-b)
  2. 使用from scipy.spatial import distance:dst = distance.euclidean(a, b)

如何应用这些公式之一来计算来自极坐标数据框的列c和d之间的距离:

import polars as pl
from scipy.spatial import distance
import numpy as np
pl.Config.set_fmt_str_lengths(2000)
data={"a": ["782.83    7363.51    6293    40   PD","850.68    7513.1    6262.17    40   PD"], "b": ["795.88    7462.65    6293    40   PD","1061.64    7486.08    6124.85    40   PD"]}
df=pl.DataFrame(data)
df=df.with_columns([
    pl.col("a").str.replace_all(r" +", " ")\
        .str.split(" ").arr.slice(0,3)\
        .cast(pl.List(pl.Float64)).alias("c"),\
    pl.col("b").str.replace_all(r" +", " ")\
        .str.split(" ").arr.slice(0,3)\
        .cast(pl.List(pl.Float64)).alias("d")\
])
print(df)

我的尝试如下:

df=df.with_columns(np.linalg.norm(pl.col("c")-pl.col("d")).alias("distance"))
df=df.with_columns(distance(pl.col("c"),pl.col("d")).alias("distance"))

但上述任何一种方法都不起作用。感谢您的帮助。

Artur

英文:

This is the continuation of my previous question
User glebcom helped me with transition of coordinates from a string to list of float64 values.
In the answer I found 2 methods how to calculate distance between coordinates:

  1. using formula numpy.linalg.norm(a-b)
  2. using from scipy.spatial import distance:dst = distance.euclidean(a, b)
    How to apply one of these formulas to calculate the distance between corrdinates from column c and d from polars data frame
import polars as pl
from scipy.spatial import distance
import numpy as np
pl.Config.set_fmt_str_lengths(2000)
data={"a": ["782.83    7363.51    6293    40   PD","850.68    7513.1    6262.17    40   PD"], "b": ["795.88    7462.65    6293    40   PD","1061.64    7486.08    6124.85    40   PD"]}
df=pl.DataFrame(data)
df=df.with_columns([
    pl.col("a").str.replace_all(r" +", " ")\
        .str.split(" ").arr.slice(0,3)\
        .cast(pl.List(pl.Float64)).alias("c"),\
    pl.col("b").str.replace_all(r" +", " ")\
        .str.split(" ").arr.slice(0,3)\
        .cast(pl.List(pl.Float64)).alias("d")\
])
print(df)

My tries were

df=df.with_columns(np.linalg.norm(pl.col("C")-pl.col("d")).alias("distance"))
or
df=df.with_columns(distance(pl.col("C"),pl.col("d")).alias("distance"))

but none of the above works.
Thanks in advance for your assistance.

Artur

答案1

得分: 3

你无法直接在你的Polars数据框上调用 numpy.linalg.norm。它需要一个形状为(N, n)的NumPy数组(其中N是点的数量,n是维度的数量,为3)。

你可以自己准备数据,将其传递给NumPy,然后将结果放回Polars。

首先,计算两个点之间在所有3个维度上的坐标差异:

diffs = df.select(
    [
        (pl.col("c").arr.get(i) - pl.col("d").arr.get(i)).alias(f"diff_{i}")
        for i in range(3)
    ]
)

然后将其转换为NumPy并调用函数:

import numpy.linalg
distance = numpy.linalg.norm(diffs.to_numpy(), axis=1)
pl.Series(distance).alias("distance")

或者你可以自己计算欧氏距离:

df.select(
    [
        (pl.col("c").arr.get(i) - pl.col("d").arr.get(i)).alias(f"diff_{i}") ** 2
        for i in range(3)
    ]
).sum(axis=1).sqrt()

PS:scipy.spatial.distance.euclidean 不适用,因为它一次只能处理一个点,这将在Polars中变得非常慢。

英文:

You won't be able to call numpy.linalg.norm directly on your polars data frame. It expects a numpy array of shape (N, n) (where N is your number of points and n is your number of dimension, 3).

You can prepare the data your self, pass it to numpy and put back the results in polars.

First, calculate the difference between the coordinates of your two points, across all 3 dimensions:

diffs = df.select(
    [
        (pl.col("c").arr.get(i) - pl.col("d").arr.get(i)).alias(f"diff_{i}")
        for i in range(3)
    ]
)
┌─────────┬────────┬────────┐
│ diff_0  ┆ diff_1 ┆ diff_2 │
│ ---     ┆ ---    ┆ ---    │
│ f64     ┆ f64    ┆ f64    │
╞═════════╪════════╪════════╡
│ -13.05  ┆ -99.14 ┆ 0.0    │
│ -210.96 ┆ 27.02  ┆ 137.32 │
└─────────┴────────┴────────┘

Then convert it to numpy and call the function:

import numpy.linalg
distance=numpy.linalg.norm(diffs.to_numpy(), axis=1)
pl.Series(distance).alias("distance")
┌────────────┐
│ distance   │
│ ---        │
│ f64        │
╞════════════╡
│ 99.99521   │
│ 253.161973 │
└────────────┘

Alternatively you can calculate the euclidian product yourself:

df.select(
    [
        (pl.col("c").arr.get(i) - pl.col("d").arr.get(i)).alias(f"diff_{i}") ** 2
        for i in range(3)
    ]
).sum(axis=1).sqrt()
┌────────────┐
│ distance   │
│ ---        │
│ f64        │
╞════════════╡
│ 99.99521   │
│ 253.161973 │
└────────────┘

ps: scipy.spatial.distance.euclidean won't work because it only works with one point at time which would make it very slow in polars.

答案2

得分: 1

使用np.linalg.normmap内的解决方案

def l2_norm(s: pl.Series) -> pl.Series:
    # 1) difference: c-d
    diff = s.struct.field("c").to_numpy() - s.struct.field("d").to_numpy()
    # 2) apply np.linalg.norm()
    return pl.Series(diff).apply(
        lambda x: np.linalg.norm(np.array(x))
    )

df.with_columns([
    pl.struct(["c", "d"]).map(l2_norm).alias("distance")
])
┌────────────┐
│ distance   │
│ ---        │
│ f64        │
╞════════════╡
│ 99.99521   │
│ 253.161973 │
└────────────┘
英文:

Solution with np.linalg.norm inside map

def l2_norm(s: pl.Series) -> pl.Series:
    # 1) difference: c-d
    diff = s.struct.field("c").to_numpy() - s.struct.field("d").to_numpy()
    # 2) apply np.linalg.norm()
    return pl.Series(diff).apply(
        lambda x: np.linalg.norm(np.array(x))
    )

df.with_columns([
    pl.struct(["c", "d"]).map(l2_norm).alias("distance")
])
┌────────────┐
│ distance   │
│ ---        │
│ f64        │
╞════════════╡
│ 99.99521   │
│ 253.161973 │
└────────────┘

huangapple
  • 本文由 发表于 2023年2月7日 02:21:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/75365174.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定