2023年6月1日 19:42:16go评论69阅读模式

英文:

Masking a pandas column based on another column with slightly different values

问题

以下是您提供的代码部分的翻译：

所以我手头有两个Python中的Pandas数据框，其中包含大量的xyz坐标。其中一个将用于在另一个数据框中屏蔽/删除某些坐标，但问题是这些坐标之间有非常微小的差异，因此我不能简单地删除重复项。举个例子，假设它们看起来像这样：

    df1 = pd.DataFrame(data=None, columns=['x', 'y', 'z'])
    df1.x = [104245, 252355, 547364, 135152]
    df1.y = [842714, 135812, 425328, 124912]
    df1.z = [125125, 547574, 364343, 346372]

    df2 = pd.DataFrame(data=None, columns=['x', 'y', 'z'])
    df2.x = [104230, 547298]
    df2.y = [842498, 424989]
    df2.z = [124976, 364001]

然后，我想要的是df2中的第一和第二行xyz坐标，将df1中的第一和第三行删除。我的想法是创建带有四舍五入值的新列，进行比较，然后基于这些值进行删除。它会看起来像这样：

    df1['id'] = np.linspace(0, len(df1)-1, len(df1))
    df2['id'] = np.linspace(0, len(df2)-1, len(df2))

    df3 = df1.round({'x': -3, 'y': -3, 'z': -3})
    df4 = df2.round({'x': -3, 'y': -3, 'z': -3})

    df5 = df3.merge(df4, on=['x', 'y', 'z'], how='inner')
    df6 = df1[~df1.index.isin(df5.id_x)]

这个方法可以有效地删除一些值，但通常它们四舍五入到不同的位置。我希望能得到帮助，看看是否有一种更好的方法，可以屏蔽那些在三个坐标中最接近的值。也许它可以找到df1和df2之间最接近的xyz对，并屏蔽这些对。如果有人有任何想法，我将不胜感激！

英文:

So what I have is two Pandas dataframes in Python with a large number of xyz-coordinates. One of them will be used to mask/remove some coordinates in the other one, but the problem is that the coordinates are very slightly different so that I cannot simply remove duplicates. As an example, let's say they look like this:

df1 = pd.DataFrame(data=None, columns=[&#39;x&#39;, &#39;y&#39;, &#39;z&#39;])
df1.x = [104245, 252355, 547364, 135152]
df1.y = [842714, 135812, 425328, 124912]
df1.z = [125125, 547574, 364343, 346372]

df2 = pd.DataFrame(data=None, columns=[&#39;x&#39;, &#39;y&#39;, &#39;z&#39;])
df2.x = [104230, 547298]
df2.y = [842498, 424989]
df2.z = [124976, 364001]

What I then want is for the first and second xyz-rows in df2 to remove the first and third row in df1. My idea was to create new columns with rounded values, compare those, and remove based on those. It would look something like this:

df1[&#39;id&#39;] = np.linspace(0,len(df1)-1,len(df1))
df2[&#39;id&#39;] = np.linspace(0,len(df2)-1,len(df2))

df3 = df1.round({&#39;x&#39;: -3, &#39;y&#39;: -3, &#39;z&#39;: -3})
df4 = df2.round({&#39;x&#39;: -3, &#39;y&#39;: -3, &#39;z&#39;: -3})

df5 = df3.merge(df4, on=[&#39;x&#39;, &#39;y&#39;, &#39;z&#39;], how=&#39;inner&#39;)
df6 = df1[~df1.index.isin(df5.id_x)]

This works fine to remove some of the values, but often they round to different places. I was hoping with help if there is a better method to mask those values which are simply closest in all three coordinates. Maybe that it finds the closest xyz-pair between df1 and df2 and masks those pairs. If anyone has any ideas I would really appreciate it!

答案1

得分: 2

使用KDTree：

from scipy.spatial import KDTree

# 计算与df2的最近距离
distances, indices = KDTree(df2).query(df1, k=1)

# 保留距离大于等于3000的行
out = df1[distances >= 3000]

输出：

>>> out
        x       y       z
1  252355  135812  547574
3  135152  124912  346372

英文:

Use KDTree:

from scipy.spatial import KDTree

# compute the distance with the nearest distance from df2
distances, indices = KDTree(df2).query(df1, k=1)

# keep row if the distance is greater than 3000
out = df1[distances &gt;= 3000]

Output:

&gt;&gt;&gt; out
        x       y       z
1  252355  135812  547574
3  135152  124912  346372

答案2

得分: 2

你可以使用numpy广播来考虑坐标之间的个别距离：

# 将DataFrames转换为numpy数组
a1 = df1.to_numpy()
a2 = df2.to_numpy()

# 定义一个距离阈值，在此阈值以下的坐标被视为相等
thresh = 500

# 计算距离，识别所有坐标上的匹配项
matches = (abs(a1[:,None]-a2) <= thresh).all(axis=-1)

idx1, idx2 = np.where(matches)
# (array([0, 2]), array([0, 1]))

out = df1.drop(df1.index[idx1])

要考虑点之间的欧几里德距离（同时考虑所有坐标），请使用scipy.spatial.distance.cdist：

from scipy.spatial.distance import cdist

thresh = 1000

matches = cdist(a1, a2) <= thresh

idx1, idx2 = np.where(matches)

out = df1.drop(df1.index[idx1])

输出：

        x       y       z
1  252355  135812  547574
3  135152  124912  346372

从`df1`中删除距离`df2`的每一行最近的单个点，距离低于阈值

from scipy.spatial.distance import cdist

thresh = 1000

dist = cdist(a1, a2)

idx = np.argmin(dist, axis=0)

out = df1.drop(df1.index[idx[dist[idx, np.arange(len(a2))] <= thresh]])

如果距离不重要，你只想删除最近的点：

from scipy.spatial.distance import cdist

dist = cdist(a1, a2)

idx = np.argmin(dist, axis=0)

out = df1.drop(df1.index[idx])

英文:

You can use numpy broadcasting to consider the individual distances between the coordinates:

# convert DataFrames to numpy arrays
a1 = df1.to_numpy()
a2 = df2.to_numpy()

# define a distance below which the coordinates are considered equal
thresh = 500

# compute the distances, identify matches on all coordinates
matches = (abs(a1[:,None]-a2) &lt;= thresh).all(axis=-1)

idx1, idx2 = np.where(matches)
# (array([0, 2]), array([0, 1]))

out = df1.drop(df1.index[idx1])

To consider the euclidean distance between the points (taking into account all coordinates simultaneously), use scipy.spatial.distance.cdist:

from scipy.spatial.distance import cdist

thresh = 1000

matches = cdist(a1, a2) &lt;= thresh

idx1, idx2 = np.where(matches)

out = df1.drop(df1.index[idx1])

Output:

        x       y       z
1  252355  135812  547574
3  135152  124912  346372

removing the single point from `df1` that is closest to each row of `df2` and below a threshold

from scipy.spatial.distance import cdist

thresh = 1000

dist = cdist(a1, a2)

idx = np.argmin(dist, axis=0)

out = df1.drop(df1.index[idx[dist[idx, np.arange(len(a2))] &lt;= thresh]])

If the distance doesn't matter and you only want to remove the closest point:

from scipy.spatial.distance import cdist

dist = cdist(a1, a2)

idx = np.argmin(dist, axis=0)

out = df1.drop(df1.index[idx])

答案3

得分: 1

另一个可能的解决方案，使用了 numpy 广播：

thresh = 1000
df1[~np.any(
    np.sqrt(np.sum((df1.values - df2.values[:, None])**2, axis=2)) < thresh,
    axis=0 )]

如果我们想要移除最接近的点，可以使用：

df1.iloc[df1.index.difference(np.argmin(
    np.sqrt(np.sum((df1.values - df2.values[:, None])**2, axis=2)), axis=1)), :]

输出：

       x       y       z
1  252355  135812  547574
3  135152  124912  346372

英文:

Another possible solution, which uses numpy broadcasting:

thresh = 1000
df1[~np.any(
    np.sqrt(np.sum((df1.values - df2.values[:, None])**2, axis=2)) &lt; thresh,
    axis=0 )]

If we want to remove the closest points, we can use:

df1.iloc[df1.index.difference(np.argmin(
    np.sqrt(np.sum((df1.values - df2.values[:, None])**2, axis=2)), axis=1)), :]

Output:

        x       y       z
1  252355  135812  547574
3  135152  124912  346372

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

根据另一列具有略有不同值的 pandas 列进行屏蔽

问题

答案1

答案2

从`df1`中删除距离`df2`的每一行最近的单个点，距离低于阈值

removing the single point from `df1` that is closest to each row of `df2` and below a threshold

答案3

pandas – 在多列中筛选具有相同值的行

在Flipkart产品上抓取评论时未能获取所有评论？

Improving performance of animated shape drawing using Python Qt5 QPainter

如何配置EventBridge以使最终用户能够调度我的Lambda函数？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论

问题

答案1

答案2

从df1中删除距离df2的每一行最近的单个点，距离低于阈值

removing the single point from df1 that is closest to each row of df2 and below a threshold

答案3

发表评论

从`df1`中删除距离`df2`的每一行最近的单个点，距离低于阈值

removing the single point from `df1` that is closest to each row of `df2` and below a threshold