英文:
Masking a pandas column based on another column with slightly different values
问题
以下是您提供的代码部分的翻译:
所以我手头有两个Python中的Pandas数据框,其中包含大量的xyz坐标。其中一个将用于在另一个数据框中屏蔽/删除某些坐标,但问题是这些坐标之间有非常微小的差异,因此我不能简单地删除重复项。举个例子,假设它们看起来像这样:
df1 = pd.DataFrame(data=None, columns=['x', 'y', 'z'])
df1.x = [104245, 252355, 547364, 135152]
df1.y = [842714, 135812, 425328, 124912]
df1.z = [125125, 547574, 364343, 346372]
df2 = pd.DataFrame(data=None, columns=['x', 'y', 'z'])
df2.x = [104230, 547298]
df2.y = [842498, 424989]
df2.z = [124976, 364001]
然后,我想要的是df2中的第一和第二行xyz坐标,将df1中的第一和第三行删除。我的想法是创建带有四舍五入值的新列,进行比较,然后基于这些值进行删除。它会看起来像这样:
df1['id'] = np.linspace(0, len(df1)-1, len(df1))
df2['id'] = np.linspace(0, len(df2)-1, len(df2))
df3 = df1.round({'x': -3, 'y': -3, 'z': -3})
df4 = df2.round({'x': -3, 'y': -3, 'z': -3})
df5 = df3.merge(df4, on=['x', 'y', 'z'], how='inner')
df6 = df1[~df1.index.isin(df5.id_x)]
这个方法可以有效地删除一些值,但通常它们四舍五入到不同的位置。我希望能得到帮助,看看是否有一种更好的方法,可以屏蔽那些在三个坐标中最接近的值。也许它可以找到df1和df2之间最接近的xyz对,并屏蔽这些对。如果有人有任何想法,我将不胜感激!
英文:
So what I have is two Pandas dataframes in Python with a large number of xyz-coordinates. One of them will be used to mask/remove some coordinates in the other one, but the problem is that the coordinates are very slightly different so that I cannot simply remove duplicates. As an example, let's say they look like this:
df1 = pd.DataFrame(data=None, columns=['x', 'y', 'z'])
df1.x = [104245, 252355, 547364, 135152]
df1.y = [842714, 135812, 425328, 124912]
df1.z = [125125, 547574, 364343, 346372]
df2 = pd.DataFrame(data=None, columns=['x', 'y', 'z'])
df2.x = [104230, 547298]
df2.y = [842498, 424989]
df2.z = [124976, 364001]
What I then want is for the first and second xyz-rows in df2 to remove the first and third row in df1. My idea was to create new columns with rounded values, compare those, and remove based on those. It would look something like this:
df1['id'] = np.linspace(0,len(df1)-1,len(df1))
df2['id'] = np.linspace(0,len(df2)-1,len(df2))
df3 = df1.round({'x': -3, 'y': -3, 'z': -3})
df4 = df2.round({'x': -3, 'y': -3, 'z': -3})
df5 = df3.merge(df4, on=['x', 'y', 'z'], how='inner')
df6 = df1[~df1.index.isin(df5.id_x)]
This works fine to remove some of the values, but often they round to different places. I was hoping with help if there is a better method to mask those values which are simply closest in all three coordinates. Maybe that it finds the closest xyz-pair between df1 and df2 and masks those pairs. If anyone has any ideas I would really appreciate it!
答案1
得分: 2
使用KDTree:
from scipy.spatial import KDTree
# 计算与df2的最近距离
distances, indices = KDTree(df2).query(df1, k=1)
# 保留距离大于等于3000的行
out = df1[distances >= 3000]
输出:
>>> out
x y z
1 252355 135812 547574
3 135152 124912 346372
英文:
Use KDTree:
from scipy.spatial import KDTree
# compute the distance with the nearest distance from df2
distances, indices = KDTree(df2).query(df1, k=1)
# keep row if the distance is greater than 3000
out = df1[distances >= 3000]
Output:
>>> out
x y z
1 252355 135812 547574
3 135152 124912 346372
答案2
得分: 2
你可以使用numpy广播来考虑坐标之间的个别距离:
# 将DataFrames转换为numpy数组
a1 = df1.to_numpy()
a2 = df2.to_numpy()
# 定义一个距离阈值,在此阈值以下的坐标被视为相等
thresh = 500
# 计算距离,识别所有坐标上的匹配项
matches = (abs(a1[:,None]-a2) <= thresh).all(axis=-1)
idx1, idx2 = np.where(matches)
# (array([0, 2]), array([0, 1]))
out = df1.drop(df1.index[idx1])
要考虑点之间的欧几里德距离(同时考虑所有坐标),请使用scipy.spatial.distance.cdist
:
from scipy.spatial.distance import cdist
thresh = 1000
matches = cdist(a1, a2) <= thresh
idx1, idx2 = np.where(matches)
out = df1.drop(df1.index[idx1])
输出:
x y z
1 252355 135812 547574
3 135152 124912 346372
从df1
中删除距离df2
的每一行最近的单个点,距离低于阈值
from scipy.spatial.distance import cdist
thresh = 1000
dist = cdist(a1, a2)
idx = np.argmin(dist, axis=0)
out = df1.drop(df1.index[idx[dist[idx, np.arange(len(a2))] <= thresh]])
如果距离不重要,你只想删除最近的点:
from scipy.spatial.distance import cdist
dist = cdist(a1, a2)
idx = np.argmin(dist, axis=0)
out = df1.drop(df1.index[idx])
英文:
You can use numpy broadcasting to consider the individual distances between the coordinates:
# convert DataFrames to numpy arrays
a1 = df1.to_numpy()
a2 = df2.to_numpy()
# define a distance below which the coordinates are considered equal
thresh = 500
# compute the distances, identify matches on all coordinates
matches = (abs(a1[:,None]-a2) <= thresh).all(axis=-1)
idx1, idx2 = np.where(matches)
# (array([0, 2]), array([0, 1]))
out = df1.drop(df1.index[idx1])
To consider the euclidean distance between the points (taking into account all coordinates simultaneously), use scipy.spatial.distance.cdist
:
from scipy.spatial.distance import cdist
thresh = 1000
matches = cdist(a1, a2) <= thresh
idx1, idx2 = np.where(matches)
out = df1.drop(df1.index[idx1])
Output:
x y z
1 252355 135812 547574
3 135152 124912 346372
removing the single point from df1
that is closest to each row of df2
and below a threshold
from scipy.spatial.distance import cdist
thresh = 1000
dist = cdist(a1, a2)
idx = np.argmin(dist, axis=0)
out = df1.drop(df1.index[idx[dist[idx, np.arange(len(a2))] <= thresh]])
If the distance doesn't matter and you only want to remove the closest point:
from scipy.spatial.distance import cdist
dist = cdist(a1, a2)
idx = np.argmin(dist, axis=0)
out = df1.drop(df1.index[idx])
答案3
得分: 1
另一个可能的解决方案,使用了 numpy 广播
:
thresh = 1000
df1[~np.any(
np.sqrt(np.sum((df1.values - df2.values[:, None])**2, axis=2)) < thresh,
axis=0 )]
如果我们想要移除最接近的点,可以使用:
df1.iloc[df1.index.difference(np.argmin(
np.sqrt(np.sum((df1.values - df2.values[:, None])**2, axis=2)), axis=1)), :]
输出:
x y z
1 252355 135812 547574
3 135152 124912 346372
英文:
Another possible solution, which uses numpy broadcasting
:
thresh = 1000
df1[~np.any(
np.sqrt(np.sum((df1.values - df2.values[:, None])**2, axis=2)) < thresh,
axis=0 )]
If we want to remove the closest points, we can use:
df1.iloc[df1.index.difference(np.argmin(
np.sqrt(np.sum((df1.values - df2.values[:, None])**2, axis=2)), axis=1)), :]
Output:
x y z
1 252355 135812 547574
3 135152 124912 346372
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论