根据另一列具有略有不同值的 pandas 列进行屏蔽

huangapple go评论64阅读模式
英文:

Masking a pandas column based on another column with slightly different values

问题

以下是您提供的代码部分的翻译:

所以我手头有两个Python中的Pandas数据框其中包含大量的xyz坐标其中一个将用于在另一个数据框中屏蔽/删除某些坐标但问题是这些坐标之间有非常微小的差异因此我不能简单地删除重复项举个例子假设它们看起来像这样

    df1 = pd.DataFrame(data=None, columns=['x', 'y', 'z'])
    df1.x = [104245, 252355, 547364, 135152]
    df1.y = [842714, 135812, 425328, 124912]
    df1.z = [125125, 547574, 364343, 346372]

    df2 = pd.DataFrame(data=None, columns=['x', 'y', 'z'])
    df2.x = [104230, 547298]
    df2.y = [842498, 424989]
    df2.z = [124976, 364001]

然后我想要的是df2中的第一和第二行xyz坐标将df1中的第一和第三行删除我的想法是创建带有四舍五入值的新列进行比较然后基于这些值进行删除它会看起来像这样

    df1['id'] = np.linspace(0, len(df1)-1, len(df1))
    df2['id'] = np.linspace(0, len(df2)-1, len(df2))

    df3 = df1.round({'x': -3, 'y': -3, 'z': -3})
    df4 = df2.round({'x': -3, 'y': -3, 'z': -3})

    df5 = df3.merge(df4, on=['x', 'y', 'z'], how='inner')
    df6 = df1[~df1.index.isin(df5.id_x)]

这个方法可以有效地删除一些值但通常它们四舍五入到不同的位置我希望能得到帮助看看是否有一种更好的方法可以屏蔽那些在三个坐标中最接近的值也许它可以找到df1和df2之间最接近的xyz对并屏蔽这些对如果有人有任何想法我将不胜感激
英文:

So what I have is two Pandas dataframes in Python with a large number of xyz-coordinates. One of them will be used to mask/remove some coordinates in the other one, but the problem is that the coordinates are very slightly different so that I cannot simply remove duplicates. As an example, let's say they look like this:

df1 = pd.DataFrame(data=None, columns=['x', 'y', 'z'])
df1.x = [104245, 252355, 547364, 135152]
df1.y = [842714, 135812, 425328, 124912]
df1.z = [125125, 547574, 364343, 346372]

df2 = pd.DataFrame(data=None, columns=['x', 'y', 'z'])
df2.x = [104230, 547298]
df2.y = [842498, 424989]
df2.z = [124976, 364001]

What I then want is for the first and second xyz-rows in df2 to remove the first and third row in df1. My idea was to create new columns with rounded values, compare those, and remove based on those. It would look something like this:

df1['id'] = np.linspace(0,len(df1)-1,len(df1))
df2['id'] = np.linspace(0,len(df2)-1,len(df2))

df3 = df1.round({'x': -3, 'y': -3, 'z': -3})
df4 = df2.round({'x': -3, 'y': -3, 'z': -3})

df5 = df3.merge(df4, on=['x', 'y', 'z'], how='inner')
df6 = df1[~df1.index.isin(df5.id_x)]

This works fine to remove some of the values, but often they round to different places. I was hoping with help if there is a better method to mask those values which are simply closest in all three coordinates. Maybe that it finds the closest xyz-pair between df1 and df2 and masks those pairs. If anyone has any ideas I would really appreciate it!

答案1

得分: 2

使用KDTree

from scipy.spatial import KDTree

# 计算与df2的最近距离
distances, indices = KDTree(df2).query(df1, k=1)

# 保留距离大于等于3000的行
out = df1[distances >= 3000]

输出:

>>> out
        x       y       z
1  252355  135812  547574
3  135152  124912  346372
英文:

Use KDTree:

from scipy.spatial import KDTree

# compute the distance with the nearest distance from df2
distances, indices = KDTree(df2).query(df1, k=1)

# keep row if the distance is greater than 3000
out = df1[distances >= 3000]

Output:

>>> out
        x       y       z
1  252355  135812  547574
3  135152  124912  346372

答案2

得分: 2

你可以使用numpy广播来考虑坐标之间的个别距离:

# 将DataFrames转换为numpy数组
a1 = df1.to_numpy()
a2 = df2.to_numpy()

# 定义一个距离阈值,在此阈值以下的坐标被视为相等
thresh = 500

# 计算距离,识别所有坐标上的匹配项
matches = (abs(a1[:,None]-a2) <= thresh).all(axis=-1)

idx1, idx2 = np.where(matches)
# (array([0, 2]), array([0, 1]))

out = df1.drop(df1.index[idx1])

要考虑点之间的欧几里德距离(同时考虑所有坐标),请使用scipy.spatial.distance.cdist

from scipy.spatial.distance import cdist

thresh = 1000

matches = cdist(a1, a2) <= thresh

idx1, idx2 = np.where(matches)

out = df1.drop(df1.index[idx1])

输出:

        x       y       z
1  252355  135812  547574
3  135152  124912  346372

df1中删除距离df2的每一行最近的单个点,距离低于阈值

from scipy.spatial.distance import cdist

thresh = 1000

dist = cdist(a1, a2)

idx = np.argmin(dist, axis=0)

out = df1.drop(df1.index[idx[dist[idx, np.arange(len(a2))] <= thresh]])

如果距离不重要,你只想删除最近的点:

from scipy.spatial.distance import cdist

dist = cdist(a1, a2)

idx = np.argmin(dist, axis=0)

out = df1.drop(df1.index[idx])
英文:

You can use numpy broadcasting to consider the individual distances between the coordinates:

# convert DataFrames to numpy arrays
a1 = df1.to_numpy()
a2 = df2.to_numpy()

# define a distance below which the coordinates are considered equal
thresh = 500

# compute the distances, identify matches on all coordinates
matches = (abs(a1[:,None]-a2) &lt;= thresh).all(axis=-1)

idx1, idx2 = np.where(matches)
# (array([0, 2]), array([0, 1]))

out = df1.drop(df1.index[idx1])

To consider the euclidean distance between the points (taking into account all coordinates simultaneously), use scipy.spatial.distance.cdist:

from scipy.spatial.distance import cdist

thresh = 1000

matches = cdist(a1, a2) &lt;= thresh

idx1, idx2 = np.where(matches)

out = df1.drop(df1.index[idx1])

Output:

        x       y       z
1  252355  135812  547574
3  135152  124912  346372

removing the single point from df1 that is closest to each row of df2 and below a threshold

from scipy.spatial.distance import cdist

thresh = 1000

dist = cdist(a1, a2)

idx = np.argmin(dist, axis=0)

out = df1.drop(df1.index[idx[dist[idx, np.arange(len(a2))] &lt;= thresh]])

If the distance doesn't matter and you only want to remove the closest point:

from scipy.spatial.distance import cdist

dist = cdist(a1, a2)

idx = np.argmin(dist, axis=0)

out = df1.drop(df1.index[idx])

答案3

得分: 1

另一个可能的解决方案,使用了 numpy 广播

thresh = 1000
df1[~np.any(
    np.sqrt(np.sum((df1.values - df2.values[:, None])**2, axis=2)) < thresh,
    axis=0 )]

如果我们想要移除最接近的点,可以使用:

df1.iloc[df1.index.difference(np.argmin(
    np.sqrt(np.sum((df1.values - df2.values[:, None])**2, axis=2)), axis=1)), :]

输出:

       x       y       z
1  252355  135812  547574
3  135152  124912  346372
英文:

Another possible solution, which uses numpy broadcasting:

thresh = 1000
df1[~np.any(
    np.sqrt(np.sum((df1.values - df2.values[:, None])**2, axis=2)) &lt; thresh,
    axis=0 )]

If we want to remove the closest points, we can use:

df1.iloc[df1.index.difference(np.argmin(
    np.sqrt(np.sum((df1.values - df2.values[:, None])**2, axis=2)), axis=1)), :]

Output:

        x       y       z
1  252355  135812  547574
3  135152  124912  346372

huangapple
  • 本文由 发表于 2023年6月1日 19:42:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76381508.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定