英文:
Dropping rows with duplicate indexes, based on criteria
问题
I understand that you want a translation of the code and description you provided. Here's the translated code and a summary:
我正在研究我的硕士课题,遇到了一个情况,只能想出一个相当慢且不够优雅的解决方案。因此,我联系你寻求建议,如何改进这个问题。
核心是一个扩展的pandas DataFrame(我们称之为`df`以便使用),其中有一个新列用于原始索引。它的样子如下:
originalIndex A B
0 0 1 10
1 1 2 5
2 2 4 2
3 2 4 1
4 2 2 1
5 3 5 4
目标是删除所有具有`originalIndex`中重复的行,除了那些`abs(df['A'] - df['B'])`最小的行,即最小的行。因此,新的DataFrame应该如下所示:
originalIndex A B
0 0 1 10
1 1 2 5
4 2 2 1
5 3 5 4
还可以重置索引(我熟悉这部分)。
我的当前方法如下:
```python
temp = df[df.originalIndex.duplicated(keep=False) == True]
for parInd in temp.originalIndex.unique():
duplicate_df = temp[temp.originalIndex == parInd]
duplicate_df["Diff"] = duplicate_df.apply(lambda x: abs(x.A - x.B), axis=1)
minInd = duplicate_df["Diff"].argmin()
test = duplicate_df.index.drop(duplicate_df.index[minInd])
df.drop(test)
对于一个文件中有大量重复的情况,这需要很长时间,更不用说如果我需要运行多个文件。因此,我请求你提供一种改进这个方法的方式。
更新:我编辑了示例,以使df
在之前和之后都是正确的。
<details>
<summary>英文:</summary>
I'm working on my masters project and ran into a situation where I could only come up with a rather slow and inelegant solution to. Thus, I reach out to you for suggestions as to how to improve this.
The core is an exploded pandas DataFrame (Let us call it `df` for ease) with a new column for the original indexes. This essentially looks as follows:
originalIndex A B
0 0 1 10
1 1 2 5
2 2 4 2
3 2 4 1
4 2 2 1
5 3 5 4
The goal is to drop all rows which have duplicates in `originalIndex` , except the one for which `abs(df['A'] - df['B'])` is smallest, ie, minimal. Thus, the new DataFrame ought to look like:
originalIndex A B
0 0 1 10
1 1 2 5
4 2 2 1
5 3 5 4
With the possibility of resetting the indexes (I'm familiar with that part).
My current method is as such
temp = df[df.originalIndex.duplicated(keep = False) == True]
for parInd in temp.originalIndex.unique():
duplicate_df = temp[temp.originalIndex== parInd]
duplicate_df["Diff"] = duplicate_df.apply(lambda x: abs(x.A - x.B), axis=1)
minInd = duplicate_df["Diff"].argmin()
test = duplicate_df.index.drop(duplicate_df.index[minInd])
df.drop(test)
This takes ages for the large amount of duplicates a I have in one file alone, not to mention if I have to run it for more than one. Thus, I beseech you for something that improves on this ramshackle method.
Update: I've edited the example so that `df` is correct before and after.
</details>
# 答案1
**得分**: 1
I think your example should have row 2 with values `(3, 1)` for `originalIndex == 2` instead of `(4, 1)` as you show - because this in fact has the largest difference. Your code would return `(3, 1)` and remove the other rows.
你的示例中,我认为第2行应该具有值 `(3, 1)`,而不是你展示的 `(4, 1)` - 因为实际上它具有最大的差异。你的代码将返回 `(3, 1)` 并移除其他行。
You should **always** avoid using `.apply` where possible. There are many options for computing your function that are far faster, some of which include:
在可能的情况下,你应该**始终**避免使用 `.apply`。有许多计算函数的更快选项,其中一些包括:
```python
%timeit df.apply(lambda x: abs(x.A - x.B), axis=1)
# 2.77 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit abs(df.A - df.B)
# 226 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.A.sub(df.B).abs()
# 193 µs ± 9.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit df["A"].sub(df["B"]).abs()
# 173 µs ± 3.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Secondly, you should change df.drop(test)
to df.drop(test, inplace=True)
, so that the drop is kept. At the moment, your df
is not actually changing.
其次,你应该将 df.drop(test)
更改为 df.drop(test, inplace=True)
,以便保留删除操作。目前,你的 df
实际上并没有发生变化。
You could simply use the following instead:
你可以简单地使用以下方法代替:
# create difference column
df["diff"] = df["A"].sub(df["B"]).abs()
# sort by the difference, and drop duplicates in originalIndex
# (keeping the first, i.e. the row with the lowest absolute difference)
# sort by index again, so that in same order as before.
df = df.sort_values("diff", ascending=True).drop_duplicates(
subset="originalIndex", keep="first").sort_index()
# drop the "diff" column
df.drop("diff", axis=1, inplace=True)
英文:
I think your example should have row 2 with values (3, 1)
for originalIndex == 2
instead of (4, 1)
as you show - because this in fact has the largest difference. Your code would return (3, 1)
and remove the other rows.
You should always avoid using .apply
where possible. There are many options for computing your function that are far faster, some of which include:
%timeit df.apply(lambda x: abs(x.A - x.B), axis=1)
# 2.77 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit abs(df.A - df.B)
# 226 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.A.sub(df.B).abs()
# 193 µs ± 9.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit df["A"].sub(df["B"]).abs()
# 173 µs ± 3.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Secondly, you should change df.drop(test)
to df.drop(test, inplace=True)
, so that the drop is kept. At the moment, your df
is not actually changing.
You could simply use the following instead:
# create difference column
df["diff"] = df["A"].sub(df["B"]).abs()
# sort by the difference, and drop duplicates in originalIndex
# (keeping the first, i.e. the row with the lowest absolute difference)
# sort by index again, so that in same order as before.
df = df.sort_values("diff", ascending=True).drop_duplicates(
subset="originalIndex", keep="first").sort_index()
# drop the "diff" column
df.drop("diff", axis=1, inplace=True)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论